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Abstract 

Learning an input-output mapping from a set of examples, of the type that many 
neural networks have been constructed to perform, can be regarded as synthesizing 
an approximation of a multi- dimensional function, that is solving the problem of hy- 
persurface reconstruction. From this point of view, this form of learning is closely 
related to classical approximation techniques, such as generalized splines and regular- 
ization theory. This paper considers the problems of an exact representation and, in 
more detail, of the approximation of linear and nonlinear mappings in terms of sim- 
pler functions of fewer variables. Kolmogorov's theorem concerning the representation 
of functions of several variables in terms of functions of one variable turns out to be 
almost irrelevant in the context of networks for learning. We develop a theoretical 
framework for approximation based on regularization techniques that leads to a class 
of three-layer networks that we call Generalized Radial Basis Functions (GRBF), since 
they are mathematically related to the well-known Radial Basis Functions, mainly used 
for strict interpolation tasks. GRBF networks are not only equivalent to generalized 
splines, but are also closely related to pattern recognition methods such as Parzen 
windows and potential functions and to several neural network algorithms, such as 
Kanerva's associative memory, backpropagation and Kohonen's topology preserving 
map. They also have an interesting interpretation in terms of prototypes that are 
synthesized and optimally combined during the learning stage. The paper introduces 
several extensions and applications of the technique and discusses intriguing analogies 
with neurobiological data. 
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1 Learning as Approximation 

The problem of learning a mapping between an input and an output space is essentially 
equivalent to the problem of synthesizing an associative memory that retrieves the appropri- 
ate output when presented with the input and generalizes when presented with new inputs. 
It is also equivalent to the problem of estimating the system that transforms inputs into 
outputs given a set of examples of input-output pairs. A classical framework for this prob- 
lem is approximation theory. Related fields are system identification techniques when it is 
possible to choose the input set and system estimation techniques when the input-output 
pairs are given. A suggestive point of view on networks and classical representation and 
approximation methods is provided by Omohundro (1987) and an interesting review of net- 
works, statistical inference, and estimation techniques can be found in Barron and Barron, 
1988. Learning from the point of view of approximation has been also considered among 
others by J. Schwartz (1988), Poggio et al. (1988, 1989), Aloimonos (1989), Hurlbert and 
Poggio (1988) and Poggio (1975). 

Approximation theory deals with the problem of approximating or interpolating a contin- 
uous, multivariate function f(X) by an approximating function F(W, X) having a fixed num- 
ber of parameters W (X and W are real vectors X — x\, x 2 , ..., x n and W = Wi,w 2 , ..., w m ). 
For a choice of a specific i* 1 , the problem is then to find the set of parameters W that pro- 
vides the best possible approximation of / on the set of "examples". This is the learning 
step. Needless to say, it is very important to choose an approximating function F that can 
represent / as well as possible. There would be little point in trying to learn, if the chosen 
approximation function F(W,X) could only give a very poor representation of f(X), even 
with optimal parameter values. Therefore, it is useful to separate three main problems: 

1) the problem of which approximation to use, i.e. which classes of functions f(X) can be 
effectively approximated by which approximating functions F(W,X). This is a representa- 
tion problem. Our motivation is similar to Minsky's and Papert's in studying the properties 
and limitations of Perceptrons to deal with specific classes of problems. The issue of complex- 
ity of the approximation arises naturally at this point. The complexity of the approximation 
- measured, for instance, by number of terms, is directly related to the scaling problem of the 
neural network literature (Rumelhart et al., 1986), to the concept of order ', a central point 
in Perceptrons and to the curse of dimensionality, well-known in statistical estimation. 

2) the problem of which algorithm to use for finding the optimal values of the parameters 
W for a given choice of F. 

3) the problem of an efficient implementation of the algorithm in parallel, possibly analog 
hardware. 

This paper deals with the first two of these problems and is especially focused on the first. 

1.1 Networks and Approximation Schemes 

Almost all approximation schemes can be mapped into some kind of network that can be 
dubbed as a "neural network" 1 . Networks, after all, can be regarded as a graphic notation 
for a large class of algorithms. In the context of our discussion, a network is a function 



1 Many instances of Neural Networks should be called Non-Neural-Networks, since their relation to bio- 
logical neurons is weak at best 
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represented by the composition of many basic functions. To see how the approximation 
problem maps into a network formulation, let us introduce some definitions. 

To measure the quality of the approximation, one introduces a distance function p to 
determine the distance p[f(X),F(W,X)] of an approximation F(W,X) from f(X). The 
distance is usually induced by a norm, for instance the standard L 2 norm. The approxima- 
tion problem can be then stated formally as: 

Approximation problem If f(X) is a continuous function defined on set X ; and 
F(W,X) is an approximating function that depends continuously on W £ P and X, the 
approximation problem is to determine the parameters W* such that 

p[F(W\X)J(X)} < p[F{W,X),f{X)] 
for all W in the set P . 

With these definitions we can consider a few examples of F(W,X), shown in the figure 

• the classical linear case is 

F(W,X) = W-X 

where W is an m X n matrix and X is an n-dimensional vector. It corresponds to a 
network without hidden units; 

• the classical approximation scheme is linear in a suitable basis of functions <$>i(X) of 
the original inputs X, that is 

F(W,X) = W-$i(X) 

and corresponds to a network with one layer of hidden units. Spline interpolation and 
many approximation schemes, such as expansions in series of orthogonal polynomials, 
are included in this representation. When the $; are products and powers of the input 
components JQ, F is a polynomial. 

• the nested sigmoids scheme (usually called backpropagation, BP in short) can be writ- 
ten as 

F(W,X) = a(£w n a(£v i a(...a(£u j X j )...))) 

n i j 

and corresponds to a multilayer network of units that sum their inputs with "weights" 
w, v, u, . . . and then perform a sigmoidal transformation of this sum. This scheme (of 
nested nonlinear functions) is unusual in the classical theory of the approximation of 
continuous functions. Its motivation is that 



F(W,X) = vQ2w n v(E u i X i)) 

n 3 

with a being a linear threshold function, can represent all Boolean functions (any 
mapping S from / = {0, 1}^ into {0, 1} can be written as a disjunction of conjunctions, 
which in terms of threshold elements becomes the above expression, where biases or 
dummy inputs are allowed). Networks of this type, with one layer of hidden units, can 
approximate uniformly any continuous <f-variate functions (see, for instance, Cybenko, 
1989 or Funahashi, 1989; Cybenko, 1988 and Moore and Poggio, 1988, among others, 
proved the same result for the case of two layers of hidden units). 

In general, each approximation scheme has some specific algorithm for finding the optimal 
set of parameters W. An approach that works in general, though it may not be the most 
efficient in any specific case, is some relaxation method, such as gradient descent or conjugate 
gradient or simulated annealing, in parameter space, attempting to minimize the error p 
over the set of examples. In any case, our discussion suggests that networks of the type 
used recently for simple learning tasks can be considered as specific methods of function 
approximation. This observation suggests that we approach the problem of learning from 
the point of view of classical approximation theory. 

In this paper, we will be mainly concerned with the first of the problems listed earlier, 
that is the problem of developing a well-founded and sufficiently general approximation 
scheme, which maps into multilayer networks. We will only touch upon the second problem 
of characterizing efficient "learning" algorithms for estimating parameters from the data. 

The plan of the paper is as follows. We first consider the question of whether exact, instead 
of approximated, representations are possible for a large class of functions, since a theorem of 
Kolmogorov has been sometime interpreted as supporting this claim. We conclude that exact 
representations with the required properties do not exist. Good and general approximating 
representations, however, may exist. Thus section 3 discusses the formulation of the problem 
of learning from examples as the problem of approximation of mappings and, in particular, 
as hypersurface reconstruction. From this point of view, regularization techniques used for 
surface reconstruction can be applied also to the problem of learning. The problem of the 
connection between regularization techniques and feedforward, multilayer networks is left 
open. Sections 4 and 5 provide an answer to this question by showing that regularization 
leads to an approximation scheme, called Generalized Radial Basis Functions (GRBFs), 
which is general, powerful and maps into a class of networks with one layer of hidden units. 
We show that GRBFs are mathematically strictly related to the well-known interpolation 
method of Radial Basis Functions. Section 4 reviews some of the existing results about RBF, 
while Section 5 derives the main result of this paper, that is the derivation of GRBFs from 
regularization. In section 6 we discuss how the framework of GRBFs encompasses several 
existing "neural network" schemes. The possible relevance of the work to neurophysiology 
is then briefly outlined, together with a number of interesting properties of gaussian radial 
basis functions. Section 8 discusses several extensions and applications of the method. We 
conclude with some comments on the crucial problem of dimensionality faced by this and 
any other learning or approximation technique 2 . 



2 A preliminary version of the ideas developed here have appeared in (Poggio and the staff, 1989). 






Figure 1: (a) A linear approximating function maps into a network without a hidden layer 
with linear units. The "weights" of the connections correspond to the matrix W - the linear 
estimator, (b) Polynomial estimators and other linear combinations of nonlinear "features" 
of the input correspond to networks with one hidden layer. In the case of polynomials the 
hidden units correspond to products (II) and powers of the components of the input vector. 
The first layer of connections is fixed; the second is modified during "training", (c) A back- 
propagation network with one layer of hidden sigmoid units. Both sets of connections - 
from the input layer to the hidden units (wi) and from there to the output layer (w 2 ) - are 
modified during training. 



2 Kolmogorov's Theorem: An Exact Representation 
is Hopeless 

Before discussing more extensively the approximation problem, it is obviously important to 
answer the question of whether an exact representation exists for continuous functions in 
terms of simpler functions. For instance, if all multivariate functions could be represented 
exactly and nicely as the sum or product of univariate ones, we could use networks consisting 
of units with just one input and one output. Recently, it has been claimed that a theorem 
of this type, due to Kolmogorov (1957), could be used to justify the use of multilayer net- 
works (Hecht-Nielsen, 1987; Barron and Barron, 1988; see also Poggio, 1982). The original 
statement (Lorentz, 1976) is the following: 

Theorem 2.1 (Kolmogorov, 1957) There exist fixed increasing continuous functions h vq (x 
on I — [0, 1] so that each continuous function f on I n can be written in the form 

2n+l n 

f(x U ...,X n )= J2 3 q (J2 h Pq( X p))i 
q = l p=l 

where g q are properly chosen continuous functions of one variable. 

This result asserts that every multivariate continuous function can be represented by the 
superposition of a small number of univariate continuous functions. In terms of networks 
this means that every continuous function of many variables can be computed by a network 
with two hidden layers whose hidden units compute continuous functions (the functions g q 
and h pq ). Can this be considered as a proof that a network with two hidden layers is a good 
and powerful representation? The answer is no. There are at least two reasons for this. 

First, in a network implementation that has to be used for learning and generalization, 
some degree of smoothness is required for the functions corresponding to the units in the 
network. Smoothness of the h q and of g is important because the representation must be 
smooth in order to generalize and be stable against noise. A number of results of Vituskin 
(1954, 1977) and Henkin (1967) show, however, that the inner functions h pq of Kolmogorov's 
theorem are highly nonsmooth (they can be regarded as "hashing" functions, see Appendix 
A and Abelson, 1978). In fact Vitushkin (1954) had proved that there are functions of more 
than one variable which are not representable as the superposition of differentiable functions 
of one variable. Little seems to be known about the smoothness of g. Kahane (1975) shows 
that g can be represented as an absolutely convergent Fourier series. It seems that g could 
be either smooth or non-smooth, even for differentiable functions /. 

The second reason is that useful representations for approximation and learning are 
parametrized: they correspond to networks with fixed units and modifiable parameters. Kol- 
mogorov 's network is not of this type: the form of g depends on the specific function / to 
be represented (the h q are independent), g is at least as complex, in terms of bits needed to 
represent it, as /. 

A stable and usable exact representation of a function in terms of networks with two or 
more layers seems hopeless. The result obtained by Kolmogorov can then be considered as 
a "pathology" of the continuous functions. Vitushkin's results however, leave completely 



open the possibility that multilayer approximations exist. In fact it has been recently proved 
(see, for instance, Funahashi, 1989; Moore and Poggio, 1988) that a network with two layers 
of hidden sigmoidal units can approximate arbitrarily well any continuous function. It is 
interesting to notice that this statement still holds true if there is just one hidden layer 
(Carrol and Dickinson, 1989; Cybenko, 1989; Funahashi, 1989). The expansion in terms of 
sigmoid functions can then be regarded as one of the possible choices for the representation 
of a function, although little is known about its properties. The problem of finding good 
and well founded approximate representations will be considered in the rest of the paper. 

3 Learning as Hypersurface Reconstruction 

If we consider learning from the perspective of approximation, we can draw an equivalence 
between learning smooth mappings and a standard approximation problem, surface recon- 
struction from sparse data points. In this analogy, learning simply means collecting the 
examples, i.e., the input coordinates Xi,]ji and the corresponding output values at those 
locations, the height of the surface d{. This builds a look-up table. Generalization means 
estimating d in locations x,y where there are no examples, i.e. no data. This requires 
interpolating or, more generally, approximating the surface between the data points. Inter- 
polation is the limit of approximation when there is no noise in the data. This example, 
given for a surface, i.e., the graph in R 2 X i?, corresponding to the mapping from R 2 to 
i?, can be immediately extended to mappings from R n to R m (and graphs in R n X R m ). 
In this sense learning is a problem of hypersurface reconstruction. Notice that the other 
tasks of classification and of learning boolean functions may be regarded in a similar way. 
They correspond to the problems of approximating a mapping R n -^ {0, 1} and a mapping 
{0, l} n — > {0, 1}, respectively. 

3.1 Approximation, Regularization, and Generalized Splines 

From the point of view of learning as approximation, the problem of learning a smooth 
mapping from examples is ill-posed (Courant and Hilbert, 1962; Hadamard, 1964; Tikhonov 
and Arsenin, 1977) in the sense that the information in the data is not sufficient to reconstruct 
uniquely the mapping in regions where data are not available. In addition, the data are 
usually noisy. A priori assumptions about the mapping are needed to make the problem 
well-posed. Generalization is not possible if the mapping is completely random. For instance, 
examples of the mapping represented by a telephone directory (people's names into telephone 
numbers) do not help in estimating the telephone number corresponding to a new name. 
Generalization is based on the fact that the world in which we live is usually - at the 
appropriate level of description - redundant. In particular, it may be smooth: small changes 
in some input parameters determine a correspondingly small change in the output (it may be 
necessary in some cases to accept piecewise smoothness). This is one of the most general and 
weakest constraints that makes approximation possible. Other, stronger a priori constraints 
may be known before approximating a mapping, for instance that the mapping is linear, or 
has a positive range, or a limited domain or is invariant to some group of transformations. 
Smoothness of a function corresponds to the function being not fully local: the value at one 



point depends on other values nearby. The results of Stone (1982, see section 9.2) suggest 
that if nothing else is known about the function to be approximated with dimensions, say, 
higher than about 10, the only option is to assume a high degree of smoothness. If the 
function to be approximated is not sufficiently smooth, the number of examples required 
would be totally unpractical. 

3.2 Regularization Techniques for Learning 

Techniques that exploit smoothness constraints in approximation problems are well known 
under the term of standard regularization. Consider the inverse problem of finding the 
hypersurface values z, given sparse data d. Standard regularization replaces the problem 
with the variational problem of finding the surface that minimizes a cost functional consisting 
of two terms (Tikhonov, 1963; Tikhonov and Arsenin, 1977; Morozov, 1984; Bertero, 1986; 
the first to introduce this technique in computer vision was Eric Grimson, 1981). The first 
term measures the distance between the data and the desired solution z\ the second term 
measures the cost associated with a functional of the solution ||P^|| that embeds the a priori 
information on z. P is usually a differential operator. In detail, the problem is to find the 
hypersurface z that minimizes 

j2( Zt -d t ) 2 + \\\Pz\\ 2 (i) 

i 

where i is a collective index representing the points in feature space where data are avail- 
able and A, the regularization parameter, controls the compromise between the degree of 
smoothness of the solution and its closeness to the data. Therefore A is directly related to 
the degree of generalization that is enforced. It is well known that standard regularization 
provides solutions that are equivalent to generalized splines (Bertero et al., 1988). A large 
body of results in fitting and approximating with splines may be exploited. 

3.3 Learning, Bayes Theorem and Minimum Length Principle 

The formulation of the learning problem in terms of regularization is satisfying from a the- 
oretical point of view. A variational principle such as equation (1) can be solidly grounded 
on Bayesian estimation (see Appendix D). Using Bayes theorem one expresses the condi- 
tional probability distribution P z /d(z] d) of the hypersurface z given the examples d in terms 
of a prior probability P z (z) that embeds the constraint of smoothness and the conditional 
probability Pd/ Z (d] z) of d given z, equivalent to a model of the noise: 

P z / d (z; d) oc P z (z) P d / Z (d] z) . 
This can be rewritten in terms of complexities of hypothesis, defined as C(-) — — log P(-) 

C(z\d) = C(z) + C(d\z) + c (2) 

where c, which is related to Pd(d), depends only on d. The MAP estimate corresponds to 
considering the z with minimum complexity C(z\d). Maximum likelihood is the special case 
of MAP for uniform C(z) (perfect a priori ignorance). 



The maximum of this posterior probability (the MAP estimate) coincides with standard 
regularization, that is equation (1), provided that the noise is additive and gaussian and 
the prior is a gaussian distribution of a linear functional of z (see Appendix D). Under 
these conditions, the first term - J2i \\ z i ~ di\\ 2 - in the regularization principle equation 1 
corresponds to C(d\z), whereas the second term - ||P^|| 2 - corresponds to the prior C(z). 

Outside the domain of standard regularization, the prior probability distribution may 
represent other a priori knowledge than just smoothness. Piecewise constancy, for instance, 
could be used for classification tasks. Positivity, convexity, local behaviors of various types 
may be captured by an appropriate prior. Markov Random Field models, which can be 
considered as an extension of regularization, allow more flexibility in the underlying gener- 
alization conditions, for instance in terms of piecewise smoothness, by using line processes 
(see Geman and Geman, 1985 and Marroquin et al., 1987). 

Notice that in practice additional a priori information must be supplied in order to 
make the learning problem manageable. Space invariance or other invariances to appropri- 
ate groups of transformations can play a very important role in effectively countering the 
dimensionality problem (see Poggio, 1982). 

As pointed out by Rivest (in preparation), one can reverse the relationship between prior 
probabilities and complexity (see equation (2)). Instead of determining the complexity C(z) 
in equation 2 from the prior, one may measure the complexity of the a priori hypotheses 
to determine the prior probabilities. Rissanen (1978), for instance, proposes to measure 
the complexity of a hypothesis in terms of the bit length needed to encode it. In this 
sense, the MAP estimate is equivalent to the Minimum Description Length Principle: the 
hypothesis z which for given d can be described in the most compact way is chosen as the 
"best" hypothesis. Similar ideas have been explored by others (for instance Solomonoff, 
1978). They connect data compression and coding with Bayesian inference, regularization, 
hypersurface reconstruction and learning. 

3.4 From Hypersurface Reconstruction to Networks 

In the section above we have sketched the strict relations between learning, Bayes estimation, 
MRFs, regularization and splines; splines are equivalent to standard regularization, itself a 
special case of MRF models, which are a subset of Bayesian estimators. All these methods 
can be implemented in terms of parallel networks: in particular, we have argued that MRFs 
can be implemented in terms of hybrid networks of coupled analog and digital elements 
(Marroquin et al., 1987). Standard regularization can be implemented by resistive grids, 
and has been implemented on an analog VLSI chip (Harris, 1989). It is then natural to ask 
if splines, and more generally standard regularization, can be implemented by feedforward 
multilayer networks. The answer is positive, and will be given in the next few sections 
in terms of what we call Generalized Radial Basis Functions (GRBF). GRBFs are closely 
related to an interpolation technique called Radial Basis Functions (RBF), which has recent 
theoretical foundations (see Powell, 1987 for a review) and has been used with very promising 
results (Hardy, 1971; Franke, 1982; Rippa, 1984; Broomhead and Lowe, 1988; Renals and 
Rohwer, 1989; Casdagli, 1989) . 
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4 Radial Basis Functions: A Review 

In the following sections we will describe the RBF technique, its feedforward network im- 
plementation and a straightforward extension that makes it usable for approximation rather 
than for interpolation. 

4.1 The interpolation problem and RBF 

The Radial Basis Function (RBF) method is one of the possible solutions to the real multi- 
variate interpolation problem, that can be stated as follows: 

Interpolation problem given TV different points {x; £ R n \i = 1,...TV} and TV real 
numbers {yi £ R\i = 1,...TV} find a function F from R n to R satisfying the interpolation 
conditions: 

F(xi) = yi i = l,...,N. 

The RBF approach consists in choosing F from a linear space of dimension TV, depending 
on the data points {x^}. The basis of this space is chosen to be the set of functions 

{MI|x-x,-||)|i = l,..JV} 

where h is a continuous function from i? + to R, usually called the radial basis function, and 
|| • || is the Euclidean norm on R n . Usually a polynomial is added to this basis, so that the 
solution to the interpolation problem has the following form: 

N m 

F(x) = ]>>/*( ||x - x 8 1| ) + ]>>^(x) m<n (3) 

i=l i=l 

where {pi\i = l,...,m} is a basis of the linear space 7Tk-i(R n ) of algebraic polynomials of 
degree at most h — 1 from R n to i?, and h is given. 

The interpolation conditions give TV linear equations for the (TV + m) coefficients q and d{ 
in equation (3), so that the remaining degrees of freedom are fixed by imposing the following 
constraints: 



N 

Y J c t p J (^i) = 0, j = l,...,m. 

i=l 



In order to discuss the solubility of the interpolation problem by means of this representation 
we need the following definition (Gelfand and Vilenkin, 1964; Micchelli, 1986): 

Definition 4.1 A continuous function f{t) ) defined on [0,oo) ; is said to be conditionally 
(strictly) positive definite of order h on R n if for any distinct points Xi,...,Xjv £ R n 
and scalars Ci, . . . , Cjv such that J2i=i c iP(^i) — for all p £ 7Tk-i(R n ), the quadratic form 
J2i=i J2 1= i c i c jf(\\*-i ~ x j||) Z5 (positive) nonnegative. 
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Notice that for k = this class of functions, that we denote by Vk(K a )^ reduces to the 
class of the (strictly) positive definite functions, that is the class of functions such that the 
quadratic form J2i=i J2 1= i c i c jf{\\^i ~ x j||) i s (positive) nonnegative (Schoenberg, 1938). 

Well known results of approximation theory assert that a sufficient condition for the 
existence of a solution of the form (3) to the interpolation problem is that h £ Vk(K a )- It 
is then an important problem to give a full characterization of this class. In particular we 
are interested in characterizing the set of functions that are conditionally positive definite of 
order k over any i? n , that we define as simply Vk- 

4.1.1 RBF and Positive Definite Functions 

The class Vo has been extensively studied (see Stewart, 1976, for a review), and we mention 
here one relevant result obtained by Schoenberg in 1938. Before stating his result we first 
give the following 



Definition 4.2 A function f is said to be completely monotonic on (0, oo) provided that 

dx 



it is C°°(0,oo) and ( — l) l ~[(x) > ; \/x £ (0,oo) ; V/ £ Af, where N is the set of natural 



numbers. 

We define Mi as the set of all the functions that are completely monotonic on (0,oo). In 
1938 Schoenberg was able to show the deep connection between Mi and Vo- In fact he 
proved the following theorem: 

Theorem 4.1 (Schoenberg, 1938) A function f(r) is completely monotonic on (0, oo) if 
and only if f(r 2 ) is positive definite. 

This theorem asserts that the classes Mi and Vo are the same class, but Schoenberg went 
further, proving that in his theorem positive definitess can be replaced by strictly positive 
definitess, except for trivial cases. We can then conclude that it is possible to solve the 
interpolation problem with an expansion of the type 

N 

^(x) = 5>MI|x-x,-||) (4) 

i=l 

if the function h(yt) is completely monotonic. The unknown coefficients q. can be recovered 
imposing the interpolation conditions i^(xj) = y 3 (j = 1, ...TV), that substituted in equation 
(4) yields the linear system 

N 



y 3 =X)c."Mll x i - X *ll) j = l,-,^- 



i=l 



Defining the vectors y, c and the symmetric matrix H as follows 

(y)i = vh ( c )« = c ^ ( H )ij = MII x j - x < 

the coefficients of the expansion (4) are given by 



H-'y. (5) 
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The theorem of Schoenberg ensures that the solution of system (5) always exists, since the 
matrix H can be inverted, being strictly positive definite. As an example of application of 
the theorem of Schoenberg we mention the functions e~ r and (1 + r)~ a with a > 0: since 
they are evidently completely monotonic the functions e~ r (Gaussians) and (1 + r 2 )~ a are 
strictly positive definite, and can be used as radial basis functions to interpolate any set of 
n-dimensional data points. 

From equation (5) it turns out that a necessary and sufficient condition to solve the 
interpolation problem is the invertibility of the matrix H. Schoenberg's theorem, however, 
gives only a sufficient condition, so that many other functions could be used as radial basis 
functions without being strictly positive definite. Other sufficient conditions have been 
recently given by Micchelli, that in 1986 proved the following theorem: 

Theorem 4.2 (Micchelli, 1986) Let h be a continuous function on [0, oo) and positive 
on (0,oo). Suppose its first derivative is completely monotonic but not constant on (0,oo). 
Then for any distinct vectors Xi, ...,Xjv £ R n 

(-lf^det /fc(||x,- — XjH 2 ) > . 

The essence of this theorem is that if the first derivative of a function is completely monotonic 
this function can be used as radial basis function, since the matrix H associated to it can 
be inverted. A new class of functions is then allowed to be used as radial basis functions. 
For instance the function (c 2 + r) a , with < a < 1 and c possibly zero, is not completely 
monotonic, but satisfies the conditions of theorem (4.2), so that the choice (c 2 + r 2 ) a is 
possible for the function h in (4). 

A list of functions that can be used in practice for data interpolation is given below, and 
their use is justified by the results of Schoenberg or Micchelli: 

h(r) — e~^~) (gaussian) 

W= (c 2 + r 2 )* " >0 

h(r) = (c 2 + r 2 f </3< 1 

h(r) — v r 2 + c 2 (multiquadric) 

h(r) — r (linear) 

Notice that the linear case corresponds, in one dimension, to piecewise linear interpolation, 
that is the simplest case of spline interpolation (further and stronger connections to spline 
interpolation will be discussed later). Notice that even the case f3 = | has been explic- 
itly mentioned, since it corresponds to interpolation by means of "multiquadric" surfaces. 
Multiquadrics have been introduced by Hardy (1971) and extensively used in surface inter- 
polation with very good results (Franke, 1982; Rippa, 1984). Some of the functions listed 
above have been used in practice. Gaussian RBF have been used by Agterberg (1974) and 
Schagen (1979), and an approximating RBF expansion has been studied by Klopfenstein and 
Sverdlove (1983). The latter considered the case of equally spaced data and succeeded in 
giving some error estimates. RBF have even been used by Broomhead and Lowe (1988) and 
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Casdagli (1989) to predict the behavior of dynamical systems and by Renals and Rohwer 
(1989) for phoneme classification. A crude form of RBF performed better on the Nettalk 
task (Sejnowski and Rosenfeld, 1987) than Nettalk itself (Wolpert, 1988). 

Almost all of these functions share the unpleasant property of depending on a parameter, 
that will generally depend on the distribution of the data points. However it has been noticed 
(Franke, 1982) that the results obtained with Hardy's multiquadrics (in 2 dimensions) seem 
not to depend strongly on this parameter, and that the surfaces obtained are usually very 
smooth. It is interesting to notice that, in spite of the excellent results, no theoretical basis 
existed for Hardy's multiquadrics before Micchelli's theorem. On the contrary, in the case 
of several functions, including the gaussian, a mathematical justification can be given in the 
context of regularization theory, as we shall see in section (6). 

4.1.2 RBF and Conditionally Positive Definite Functions 

The theorem of Schoenberg on the equivalence of Mi and Vo has been recently extended, 
to obtain an interesting characterization of Vk- In fact in 1986 Micchelli proved: 

Theorem 4.3 (Micchelli, 1986) h(r 2 ) £ Vk whenever h(r) is continuous on [0, oo) and 
( — l) k d p is completely monotonic on (0, oo) 

To our extents the practical implication of this theorem is the following: if the function h(r 2 ) 
is not positive definite we do not know if it can be used as radial basis function, since the 
matrix H could be singular, but if the k-th derivative of h(r) is completely monotonic a 
polynomial of degree at most k — 1 can be added to the expansion (4), (see equation (3)), so 
that it can be used to solve the interpolation problem. Notice that, according to a remark 
of Micchelli (1986), the converse of theorem (4.3) holds true: denoting by Mk the functions 
whose k-th. derivative belongs to yW, f(r 2 ) £ Vk if and only if f(r) £ A4k, so that Vk = Mk- 
It has been noticed (Micchelli, 1986; Powell, 1988) that this theorem embeds the results 
obtained by Duchon (1976, 1977) and Meinguet (1979, 1979a) in their variational approach 
to splines. For instance the functions h(r) — r? and g(r) — |rlog y^ are not completely 
monotonic, but this property holds for their second derivatives (that is they belongs to A^)- 
By theorem 4.3 the functions h(r 2 ) — r 3 and g(r 2 ) — r 2 log r ("thin plate splines") belong 
to V<i)\ It is then possible to interpolate any set of data points using h(r 2 ) and g(r 2 ) as 
radial basis functions and a linear term (polynomial of degree at most 2-1) added to the 
expansion (4). This corresponds exactly to the expressions derived by Duchon and Meinguet, 
but without some of their limitations (see Example 2 in section 5.1.2). Since this method 
has been shown to embody natural spline interpolation in one dimension (Powell, 1988), can 
then be considered as an extension of natural splines to multivariable interpolation. 

4.2 RBF and Approximation Theory 

It is natural to ask if the expansion (3) is a good approximation method. The interpolation 
property of the RBF expansion, ensured by Micchelli's theorems, is neither sufficient nor 
necessary to guarantee good results. In particular, a very important question that is always 
addressed in approximation theory is whether the solution can be prevented from badly 
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oscillating between the interpolation points when they become dense. This does not happen 
when spline functions are used, due to the smoothness constraint, but it could happen when 
other radial basis functions are chosen. 

Several results have been obtained about this question in recent years. Perhaps the most 
important of them has been obtained by Jackson (1988). He addressed a more fundamental 
and general question: given a multivariate function /(x) and a set of TV data points {x^|z = 
1, ..., TV}, do there exist a sequence of functions F/v(x), with 



N k 



i=l j=l 



and some bounded open domain on which 

\F N (x) - /(x)| -► as N -► oc ? 

Jackson gave sufficient conditions on h for this result to hold. In particular he considered 
the function h(r) — r and showed that these conditions are satisfied in R 2n+1 but not in R 2n . 
These results are encouraging and make this approach a highly promising way of dealing 
with irregular sets of data in multi-dimensional spaces. 

Finally we mention the problem of noisy data. Since in this case a strict interpolation is 
meaningless, the RBF method must be modified. The problem consists in solving the linear 
system He = y in a way that is robust against noise. A very simple solution to this problem 
has been provided in the context of regularization theory (Tikhonov and Arsenin, 1977). 
It consists in replacing the matrix H by H + A/, where / is the identity matrix, and A is 
a "small" parameter, whose magnitude is proportional to the amount of noise in the data 
points. The coefficients of the RBF expansion are then given by 

c = (H + \I)- 1 y. (6) 

This equation gives an approximating RBF expansion, and the original interpolating expan- 
sion is recovered by letting A go to zero. Since data are usually noisy, from now on we will 
refer to the RBF expansion as the one computed by means of equation (6). 

4.3 Network Implementation 

A remarkable property of this technique is that it can be implemented by a simple network 
with just one layer of hidden units, as shown in figure 2 (see Broomhead and Lowe, 1988). 
For simplicity we restrict ourselves to expansions of the type (4), disregarding the polynomial 
term, the results being valid even in the more general case in which it is included. 
The first layer of the network consists of "input" units whose number is equivalent to the 
number of independent variables of the problem. The second layer is composed by nonlinear 
"hidden" units fully connected to the first layer. There is one unit for each data point 
X; = (xi,yi,Zi,' • •) parametrized by its "center", which has the coordinates (x^, y^ Z{, • • •) 
of the data point itself. The output layer, fully connected to the hidden layer, consists 
of one (or more) linear unit(s), whose "weights" are the unknown coefficients of the RBF 
expansion. These output units may also represent a fixed, nonlinear, invertible function, as 
already observed by Broomhead and Lowe (1988) and discussed in a later section. This is 
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a. 



b. h (llx-"t n ll) 



Figure 2: a) The Radial Basis Function network for the interpolation of a bivariate function 
F. The radial hidden units h evaluate the functions h(\\^ — t n \\). A fixed invertible nonlinear 
function may be present after the final summation, b) A radial hidden unit h. The input 
is given by x = (x^y) and its parameters are the coordinates of the n-th "center" t n . The 
output is the value of the radial basis function h, centered on t n) at point x. The centers t n 
coincide in this RBF case with the data points x n 
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useful for instance in the case of classification tasks. All the learning algorithms discussed 
later extend trivially to this case. 

Notice that the architecture of the network is completely determined by the learning 
problem, and that, unlike most of the current "neural" networks, there are no unknown 
weights connecting the input and the hidden layer. Since spline interpolation can be imple- 
mented by such a network, and splines are known to have a large power of approximation, 
this shows that a high degree of approximation can be obtained by just one hidden layer 
network. 

Of course this method has its drawbacks. The main one is that the method is global, 
that is each data point contributes to the value of the interpolating function at every other 
point. The computation of the coefficients of the RBF expansion can become then a very 
time consuming operation: its complexity grows polynomially with TV, (roughly as TV 3 ) since 
an TV X TV matrix has to be inverted. In a typical application to surface reconstruction from 
sparse stereo data the number of data points can easily be more than 3000: to invert a 
sparse 3000 X 3000 matrix not only is a formidable task, but may not be meaningful, since 
we know that the probability of ill-conditioning is higher for larger and larger matrices (it 
grows like TV 3 for aA^xiV uniformly distributed random matrix) (Demmel, 1987). Another 
problem with the Radial Basis Function method, and with interpolating methods in general, 
is that data are usually noisy and therefore not suitable for interpolation; an approximation 
of the data would be preferable. In Section 5 a solution to these problems will be given 
in the context of regularization theory. The next section presents a method that has been 
proposed by Broomhead and Lowe (1988) to reduce computational complexity and gives as 
result an approximation instead of an interpolation. A new result is given, supporting its 
validity. 

4.4 Approximated Radial Basis Function 

In this section we show how the Radial Basis Functions can also be used for approximation 
rather than for interpolation. In the RBF approach the basis on which the interpolating 
function is expanded is given by a set of radial functions h translated and centered on the 
data points. The interpolating function is then a point in a multidimensional space, whose 
dimension is equal to the number of data points, which could be very large. As usual when 
dealing with spaces of a such high dimensionality we could ask if all the dimensions are really 
significant. 

This suggests that the RBF expansion could be approximated by an expansion in a 
basis with a smaller number of dimensions (Broomhead and Lowe, 1988). This can be 
accomplished by an expansion of the following form: 

F(x) = |:c„M||x-t„||) (7) 

a = l 

where the t a are K points, that we call "centers" or "knots", whose coordinates have to be 
chosen and K < N. It is clear from equation (7) that the interpolation conditions can no 
longer be satisfied. Imposing i^(x^) = ]ji leads to the following linear system: 



17 



K 

This system is overconstrained, being composed of A^ equations for K unknowns, and the 
problem must be then regularized. A least-squares approach can be adopted (see also Broom- 
head and Lowe, 1988) and the optimal solution can be written as 

c = H + y (8) 

where (H)i a — h(\\^i — t a \\) and H + is the Moore-Penrose pseudoinverse of H (Penrose, 1955; 
Ben-Israel and Greville, 1974). The matrix H is rectangular (N X A ) and its pseudoinverse 
can be computed as 

H + = (H T H)- 1 H T 

provided (H T H) -1 exists. The matrix H T H is square and its dimensionality is A , so that it 
can be inverted in time proportional to A 3 . A rough estimate suggests that this technique 
could speed the computations by a factor (]^) 3 - Of course, other methods for computing the 
pseudoinverse exist, including recursive ones (see Albert, 1972). 

As in the previous case this formulation makes sense if the matrix H T H is nonsingular. 
Micchelli's theorem is still relevant to this problem, since we prove the following corollary: 

Corollary 4.4.1 Let G be a function satisfying the conditions of MicchellVs theorem and 
Xi,...,Xjv an N -tuple of vectors in R n . If H is the (N — s) x A^ matrix H obtained from the 
matrix Gi^ 3 — G(||x ? - — Xj||) deleting s arbitrary rows, then the (N — s) X (N — s) matrix 
H H is not singular. 

To prove this corollary it is sufficient to notice that, since Micchelli's theorem holds, the 
rank of Gi 3 is N. A theorem of linear algebra states that deleting s rows from a matrix 
of rank A^ yields a matrix of rank N — s. Remembering that rank(AA T ) = rank (A) for 
every rectangular matrix A we have that rank(HH T ) — rank(H) — N — s] then HH T is not 
singular. 

This result asserts that if the set of knots is chosen to be a subset of the set of data, and 
if the conditions of Micchelli's theorem are satisfied, the pseudoinverse can be computed as 
H + = (H T H)~ 1 H T . Other choices for the set of knots seem possible in practice: for example 
Broomhead and Lowe (1988) use uniformly spaced knots to predict successfully chaotic time 
series, in conjunction with a gaussian Radial Basis Function. 

We remark that while this technique is useful for dealing with large sets of data, it has 
been proposed as a way of dealing with noise. In the next section, we will show, however, 
that if the only problem is noise, an approximating technique of radial basis functions with as 
many centers as data points can be derived in a rigorous way with the aid of regularization 
theory. An extension of the RBF method will be presented in the general framework of 
regularization theory. 
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5 Regularization Approach and Generalized Radial 
Basis Functions 

In this section we derive from regularization theory an alternative approximation method 
based on a basis of radial functions. We apply regularization theory to the approxima- 
tion/interpolation problem and we show that for a large class of stabilizers the regularized 
solution is an expansion of the radial basis function type. The approach leads to a represen- 
tation, that we call Generalized Radial Basis Functions (GRBFs), which is very similar to 
RBFs while overcoming the problem of the computational complexity of RBFs for large data 
sets. The GRBF technique can then be considered the point of contact between multilayer 
feedforward networks and the mathematical apparatus of standard regularization theory. 

5.1 Regularization Theory 

Let S = {(x;,j/;) £ R n x R\i = 1,...7V} be a set of data that we want to approximate by 
means of a function /. The regularization approach (Tikhonov, 1963; Tikhonov and Arsenin, 
1977; Morozov, 1984; Bertero, 1986) consists in looking for the function / that minimizes 
the functional 

N 

#[/] = E(y.--/(x.-)) 2 + A||i7ll 2 

i=l 

where P is a constraint operator (usually a differential operator), || • || is a norm on the 
function space to whom / belongs (usually the L 2 norm) and A is a positive real number, 
the so called regularization parameter. The structure of the operator P embodies the a 
priori knowledge about the solution, and therefore depends on the nature of the particular 
problem that has to be solved. Minimization of the functional H leads to the associated 
Euler-Lagrange equations. For a functional H[f] that can be written as 

-" [J I / (IXay ...L,(J ^ J^, J y , ..., Jxx? Jxy, Jyy-) •••Jyy...y) 

the Euler-Lagrange equations are the following (Courant and Hilbert, 1962) 

d d d 2 d 2 

d 2 d n 

+ d^ Cf - + ' ' ' + ( " 1)n W Ay - y + ' ' ' = ° " 

The functional H of the regularization approach is such that they can always be written as 

1 N 

i>i>/(x) = -ny.- -/(*))*(*-*) ( 9 ) 

i=l 

where P is the adjoint of the differential operator P and the right side comes from the 
functional derivative with respect to / of the data term of H. 
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Equation (9) is a partial differential equation, and it is well known that its solution can 
be written as the integral trasformation of its right side with a kernel given by the Green's 
function of the differential operator PP, that is the function G satisfying the following 
distributional differential equation: 

PPG(x;0 = «(x-0- 

Because of the delta functions appearing in equation (9) the integral transformation becomes 
a discrete sum and / can then be written as 

1 N 
/(x) = -py,--/(xO)G(x;x,-). (10) 

i=l 

It is important to notice that a polynomial term should in general be added to the right- 
hand side of equation (10), depending on the specific stabilizer. Equation (10) says that the 
solution of the regularization problem lives in an TV-dimensional subspace of the space of 
smooth functions. A basis for this subspace is given by the TV functions G(x;x ? ), that is by 
the Green's function G "centered" on the data points x^. 

A set of equations for the unknown coefficients q = m ~^ x ^ is easily obtained by evaluat- 
ing equation (10) at the A^ data points x^. A straightforward calculation yields the following 
linear system: 

(G + A/)c = y (11) 

where / is the identity matrix, and we have defined 

(y)t = Vi ? ( c )^ = c." ? (G)ij = G^XijXj) . 

We then conclude that the solution to the regularization problem is given by the following 
formula 

N 

/(x) = 5>G(x;x,-) (12) 

i=l 

where the coefficients satisfy the linear system (11) and a polynomial term is in general 
added. 

Notice that since ||P/|| 2 is quadratic the corresponding operator in equation (9) is self- 
adjoint and can be written as PP. Then the Green's function is symmetric: G(x; £) = 
Cr(£; x). If ||P/|| 2 is translationally invariant G will depend on the difference of its arguments 
(G = G(x — £)) and if ||P/|| 2 is rotationally and translationally invariant G will be a radial 
function: G = <3(||x - £||)*. 

Let us now compare equations (12) and (11) with equations (4) and (6). It is clear 
that if the stabilizers P is rotationally invariant then the regularized solution is given by 
an expansion in radial functions. The requirement of rotational and translational invariance 
on ||P/|| 2 is very common in practical applications. Clearly regularization with a non- 
radial stabilizer P justifies the use of appropriate non-radial basis functions, retaining all 
the approximation properties associated with the Tikhonov technique. Examples of P and 
corresponding G will be given in section 5.1.2. 
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5.1.1 Micchelli's condition and regularization define almost identical classes of 
radial basis functions 

At this point it is natural to ask about the relation between the class of functions defined 
by stabilizers P of the Tikhonov type in regularization theory (the class T) and the class 
Vk (for all k) of conditionally positive definite functions of order h (which is identical with 
Mk)- The two classes have to be very closely related, since they originate from the same 
problem - optimal approximation. The regularization approach gives - if the stabilizer is 
radially symmetric - radial functions G that satisfy 

PPG{\\x-Z\\) = 6{*-Z). (13) 

Notice that if P contains a term proportional to the function itself, then PP contains a 
constant term; by taking the Fourier transform of equation (13) and applying Bochner's 
theorem (1932, 1959) on the representation of positive definite functions (Stewart, 1976), it 
turns out that G is positive definite, that is G £ Vq. (See section 5.1.2 for details). In general 
equation (13) implies (Gelfand and Vilenkin, 1964) that G is conditionally positive definite (of 
an order determined by P). This discussion suggests that Vk = Mk 3 T (for all k): in fact a 
function G may satisfy the conditions of Micchelli's theorem 4.3 (G £ -M^), and therefore be 
conditionally positive definite of order k (G £ Vk)-> without satisfying equation (13) for any 
operator P (G (jt T). We conjecture that these functions G, G £ Vk, G (fi T are interpolating 
functions but not good approximating functions, since they do not come from a regularization 
approach. We have little reasons for this conjecture, apart from Jackson's result (1988): 
in an even number of dimensions the function h(r) — r does not satisfy his (sufficient!) 
conditions for good approximation and is not the Green's function of any known Tikhonov 
stabilizer, though it can be used as radial basis function, according Micchelli's theorem 
(4.2), since h(y/r) £ M.\. Notice that in i? 2n+1 , h(r) — r satisfies Jackson's conditions, is 
the Green's function of the thin-plate stabilizer and satisfies Micchelli's conditions. Hardy's 
multiquadrics H(r) — (c 2 -\- r 2 )^ are a possible counterexample to our conjecture, since they 
are conditionally positive definite of order one (H £ Pi), numerical work suggests that they 
have good approximation properties and we have been so far unable to obtain an expansion 
in terms of multiquadric from any regularization principle. We are happy to leave the answer 
to these questions to real mathematicians. 

5.1.2 Examples 

Example 1 

We now consider a wide class of stabilizers and show that they lead to a solution of the 
regularization problem that has the form of a radial basis function expansion of the type of 
equation (4), with a positive definite basis function and without the polynomial term. 
Let us consider the class of constraint operators defined by 



CO 



PJW = dx V a m (P m f(x)Y (14) 



n m=0 
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where P 2m = V 2m , p 2m +! = VV 2m , V 2 is the Laplacian operator and the coefficients a m 
are real positive numbers. The stabilizer is then translationally invariant and the Green's 
function satisfies the distributional differential equation: 



£(-ira m V 2m G(x-0=«5(x-0 . (15) 

m=0 

By Fourier transforming both sides of equation (15) we obtain: 

CO 

J2 a m (u ■ u) m G(u) = 1 (16) 

m=0 

and by Fourier anti-transforming G(lo) we have for the Green's function G(x): 



p ZLU-X 



G(x) = L ^M^r = L ^"^M (") 

where V(lo) is a bounded non-decreasing function if a ^ 0. Now we can apply Bochner's 
theorem (1932), which states that a function is positive definite if and only if it can be written 
in the form (17), to conclude that G(x) is positive definite. Notice that the condition a ^ 
is crucial in this particular derivation, and, as it has been pointed out by Yuille and Grzywacz 
(1988), it is a necessary and sufficient condition for the Green's function to fall asymptotically 
to zero. Let us now see some examples. 

One example is provided by the following choice of the coefficients: 

a = 1, a\ — 1, a n — \/n > 2 . 

In this case the Green's function (here in one dimension) becomes the Fourier transform of 
., ^ o . and then 

G(x) oc e W . 

Clearly this function is not very smooth, reflecting the fact that the stabilizer consists of 
derivatives of order and 1 only. Smoother functions can be obtained allowing a larger 
(possibly infinite) number of coefficients to be different from zero. For instance, setting 

1 



(2*0 



and remembering that X^ =0 t^tt = cosh(c<;) we obtain G(x) — ^/ \ , which is a very 
smooth, bell-shaped function. 

Another interesting choice (Yuille and Grzywacz, 1988) is: 



a 2m 



m\2 m 
which gives as Green's function a multidimensional Gaussian of variance a . The regular- 
ized solution is then a linear superposition of Gaussians centered on the data points x^. 
Its physical interpretation is simple: regarding a as "time" the solution satisfies the heat 
equation: 
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0(7 

with boundary conditions /(x;,cr) = %}{. The regularized solution for A = can then be 
regarded as the pattern of temperature of a conducting bar which is in contact, at the points 
X;, with infinite heat sources at temperature %}{. The value of a is then related to the diffusion 
time. 

Example 2 

A widely used class of stabilizers is given by the functionals considered by Duchon (1976, 
1977) and Meinguet (1979, 1979a) in their variational approach to multivariate interpolation, 
that is one of the possible generalizations of spline theory from one to many dimensions. In 
particular they considered rotationally invariant functionals of the form 



n * 



xlll 2 



l\ ...l m 



where <9 ? \ ? - = -, - t and 1 < m. The solution to this variational problem is of the form 

N k 

F(x) = J2 ^ m (||x - x,-||) + J2 d M*) (is) 

i=l i=l 

which is exactly the same as equation (3). Here misa measure of the degree of smoothness of 
the solution, and is related to the maximum polynomial precision that can be obtained by the 
relation: k < m. Since the stabilizers are rotationally invariant, the corresponding Green's 
functions h m are radial, and for each fixed value of m they turn out to be h m (r) — r 2m ~ n In r 
if n < 2m for n even, and h m (r) — r 2m ~ n otherwise. 

As an example we show the case n — m — 2. The functional to be minimized is 



HAf] = / dxdy 
Jr 2 




and the Green's function h is the well known "thin plate spline" h(r) — r 2 lnr. In this 
case a linear term appears as the second term of the right hand side of equation (18). Thin 
plate splines have been introduced by engineers for aeroelastic calculations (Harder and 
Desmareis, 1972), their name coming from the fact that H 2 is the bending energy of a thin 
plate of infinite extent. The results obtained with thin plate splines are comparable with 
those obtained with Hardy's multiquadrics. Another radial basis function that has been 
extensively used is r 3 , with a linear term in the expansion as well, wich is equivalent to cubic 
splines in the one dimensional case. 

5.2 Extending to Movable Centers: GRBF 

In the previous section we have shown that the function minimizing the functional H is spec- 
ified by A^ coefficients, where A^ is the number of data points. Therefore, when A^ becomes 



23 



large, the regularization approach suffers from the same drawbacks of the RBF method that 
have been pointed out in a previous section. To reduce the computational complexity of the 
representation, we then write an approximate solution /* of the regularization problem as 
an expansion involving a fewer number of centers, as done by Broomhead and Lowe (1988), 
that do not necessarily coincide with some of the data points x^, that is: 

n 

.f(x) = $>aG(x;t a ) (19) 

a = l 

where the coefficients c a and the centers t a are unknown. We now have to face the problem 
of finding the n coefficients c a and the d-n coordinates of the centers t a so that the expansion 

(19) is optimal. In this case we dispose of a natural definition of optimality, given by the 
functional H. We then impose the condition that the set {c a ,t a |a = 1, ...,n} must be such 
that it minimizes H[f*], and the following equations must be satisfied: 

dH[f] n dH[f] n 1 , 9m 

^-- = 0, -^ = 0, a=l,...,n. (20) 

We call an expansion of the type of equation (19) with the coefficients satisfying equation 

(20) a Generalized Radial Basis Function (GRBF) expansion. 

The explicit form of equation (20) depends on the specific constraint operator that has 
been used. We perform here the computations for the constraint operator ||Pi/*|| 2 considered 
in the Example 1 of the previous section, with boundary conditions such the function / and 
all its derivatives vanish on the border of the integration domain. The main difficulty is 
to find the explicit expression of ||Pi/|| 2 as a function of c a and t a . To accomplish this 
task we notice that, by using Green's formulas (Courant and Hilbert, 1962), which are the 
multidimensional analogue of integration by parts, and using our boundary conditions, the 
m-th term of Pi can be written as 

/ Jx(P"7(x)) 2 = (-l) m / ix/(x)P 2m /(x) . (21) 

JR n JR n 

Substituting equation (21) in P 1; and using definition of the differential operator PP ', we 
obtain 



\\PJW 2 = / dx/(x)i\i\ /(x) . (22) 

JR n 

When /* is substituted in equation (22) each term containing G(x) gives a delta function 
and the integral disappears, yielding: 

n 

II A/*|| 2 = E c a cpG{t a ;tp) . 

Defining a rectangular N X n matrix G as (G)i a = G(x ? -;t a ) and a symmetric n X n square 
matrix as (g) a (3 — G(t a ; t^), H[f*] can finally be written in the following simple form: 

H[f*] = c(G T G + \g)c - 2cG T y + y y . (23) 
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Notice that equation (23) is a quadratic form in the coefficients c a , so that minimization 
with respect to them is easily done. For each fixed set of centers t a the optimal vector c is 
then given by 

c = (G T G + X g y 1 G T y . (24) 

Notice that if we let A go to zero and G is radial (and centers are fixed) the approximate 
method of Broomhead and Lowe is recovered. Their method, however, lacks the capability 
of looking for the optimal set of knots of the expansion, as required by regularization theory. 
The possibility of moving the knots, by a procedure of the gradient descent type, could 
noticeably improve the quality of approximation. It has been mentioned by Broomhead and 
Lowe (1988), and has been used by Moody and Darken (1989) (their method is a heuristic 
version of RBF with moving centers, see section 6.4). Notice that from the point of view of 
approximation theory this is a nonlinear problem that reminds us of the splines with free 
knots, that are splines whose knots are allowed to vary with the function being approximated 
(De Vore and Popov, 1987; Braess, 1986). 

Clearly this approximated solution does not satisfy equation (10) anymore. However an 
explicit computation shows that equation (10) is satisfied at the centers, that is 

r(t a ) = f; y '~^ (x, ' ) G(t a ;x,-). (25) 

i=l 

The converse is also true: if one fixes the set of knots and requires that equation (25) hold 
for each a, equation (24) is easily recovered. 

5.3 GRBF and Gradient Descent 

In the previous section we introduced the GRBF method to approximate the regularized so- 
lution. It requires the minimization of the multivariate function H[f*], which is not convex 
in general. Gradient-descent is probably the simplest approach for attempting to find the 
solution to this problem, though, of course, it is not guaranteed to converge. Several other 
iterative methods, such as versions of conjugate gradient and simulated annealing (Kirk- 
patrick et al., 1983) may be better than gradient descent and should be used in practice: in 
the following we will consider for simplicity (stochastic) gradient descent. It is straightfor- 
ward to extend our equations to other methods. In the gradient descent method the values 
of c a and t a that minimize H[f*] are regarded as the coordinates of the stable fixed point 
of the following dynamical system: 

dH[f] , K 

c n — —to — , a — 1, ...A 



dc n 



l ? 



dH[f] , K 

-to — , a = L....K 



dt f 



-5 ' 



where a; is a parameter determining the microscopic timescale of the problem and is related 
to the rate of convergence to the fixed point. The gradient descent updating rules are given 
by the discretization (in time) of the previous equations. Clearly, since the function H[f*] 
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is not convex, more than one fixed point could exist, corresponding to local, suboptimal 
minima. To overcome this problem one could use "stochastic" gradient descent by adding 
a random term to the gradient descent equations. They become then stochastic equations 
of the Langevin type, often used to model the relaxation of a physical system toward the 
equilibrium in the presence of noise (Wax, 1954; Ma, 1976; Parisi, 1988). In this case the 
learning process is governed by the following stochastic equations 

c a =-u dH }^ +r) a (t), a=l,...K (26) 



d 



i a = -u^-^ + e a (t), a = 1, ...K (27) 

dt a 

where rj a and e a are white noise of zero mean and variance 

<r, a {t)r,p{t')> = < e a {t)e p {t') > = 2T8 af} 8(t -t') 

where T measures the power of the noise, analog to temperature. Solving these equations 
is similar to using Montecarlo methods of the Metropolis type (Metropolis at al., 1953) 
(decreasing the variance of the noise term during the relaxation is similar to performing 
stochastic annealing). 

We now consider for simplicity the case in which the Green's function is radial (G(x; t) = 
/^(||x — t|| 2 )) and A is set to zero. Defining the interpolation error as 

A; = yi - y* , 

where ?/* = /*(x;) is the response of the network to the z-th example, we can write the 
gradient terms as 

BH N 

— = -2^A^(||x t --t a || 2 ) , (28) 

dH N 

— = Ac a J2 A t -/>'(||x t - - t a || 2 )(x t - - t a ) (29) 

^a i=l 

where h f is the first derivatives of h. Equating -kz- to zero we notice that at the fixed point 
the knot vectors t a satisfy the following set of nonlinear equations: 

t a — — — a — 1. .... A 

" v^, pa ' ' 

where Pf — A;/?/(||x; — t a || 2 ). The optimal knots are then a weighted sum of the data points. 
The weight Pf of the data point % for a given knot a is high if the interpolation error A; is 
high there and the radial basis function centered on that knot changes quickly in a neighbor 
of the data point. This observation could suggest faster methods for finding a quasi-optimal 
set of knots. Notice that if the radial basis function h depends on a parameter e, that is if 
h = h{r, e), the functional H[f*] must be minimized even with respect to this parameter. 
The following equation must then be added to equations (28) and (29): 
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BM B 

— = -2J2 A,c a — h(\\xn - t a \\ 2 , e). (30) 

Be . Be 

As we mentioned before, an invertible nonlinearity, such a sigmoid, may be present at the 
output of the network. In this case 

N 

r(x) = a(^c„M||x-t„|| 2 )). (31) 

Oi = l 

holds and equations (28) and (29) are modified in the following way: 

BM N 

= -2j2^(y!)Xh(\\^-t a \\ 2 ) , (32) 

BH N 

— = Ac a J2 <T'(yt)Aih'(\\xi - t a || 2 )(x, - t a ) (33) 

°^a i=l 

In the gradient descent equations nothing forbids that two or more centers may move 
towards each other until they coincide. Clearly, this should be avoided (it corresponds 
to a degeneracy of the solution) in an efficient algorithm. A formal way to ensure that 
centers do never overlap is to add a term to the functional that it is minimized of the 
form J2a^p ^(\\^a — t/?||), where \& is an appropriate repulsive potential, such as ^(y) = -\. 
Equations (28) and (29) can be easily modified to reflect this additional term (see Girosi and 
Poggio, 1989a). In practice, it may be sufficient to have a criterion that forbids to any two 
centers to move too close to each other. 

In terms of networks the GRBF method has the same implementation of RBF. The 
only difference is that the parameters of the hidden layer are now allowed to vary, being 
the coordinates of the knots of the GRBF expansion. From this point of view the GRBF 
network is similar to backpropagation in the sense that there are two layers of weights to 
be modified by a gradient descent method, and there are in principle local minima, figure 3 
shows the network that should be used in practice for finding the parameters. Notice that a 
constant, a linear term (and possibly higher order polynomials) should be added to the radial 
basis representation (depending on the stabilizer, polynomials may be in the null space of 
the regularized solution, as for thin-plate splines but not for gaussian radial basis functions). 
Equations (28) and (29) should be modified appropriately. Figure 3 makes this clear. 

5.3.1 A semiheuristic algorithm 

The previous discussion and some of the analogies discussed later (with the k-means algo- 
rithm and with Moody's approach) suggest the following heuristic algorithm for GRBFs. 

1. set the number of centers and use the k-means algorithm or a comparable one (or 
even equation 29 with A; = constant) to find the initial positions t a of the centers. 
Alternatively, and more simply, set the initial value of the centers to a subset of the 
examples 
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Figure 3: The GRBF network used to approximate a mapping between Xi,x 2 ,...,x n and 
?/, given a set of sparse, noisy data. In addition to the linear combination of radial basis 
functions, the network shows other terms that contribute to the output: constant and lin- 
ear terms are shown here as direct connections from the input to the output with weights 
ao, &1, a2, <i n . Constant, linear and even higher order polynomials may be needed, depend- 
ing on the stabilizer P. Gaussian radial basis functions, on the other hand, may not need 
additional terms. 
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2. use the pseudoinverse technique to find the values of the coefficients c a (see equation 
24) 

3. use the t a and c a found so far as initial values for equations (29) and (28) 

4. explore how performance changes by changing incrementally the number of centers. 

6 Relations with other methods 

We have been impressed by the generality of this formulation and by how many existing 
schemes can be understood within this framework. In this section, we will mention briefly 
some of the most obvious connections with existing methods. In the next sections we will 
discuss possible extensions of the method, its relation to a specific style of computation that 
has biological undertones, its meaning from a Bayes point of view, and finally, some general 
points about the most crucial problem of learning, the "curse of dimensionality". 

6.1 GRBF and Classification Tasks 

RBF and GRBF are an interpolation and approximation method for continuous, in fact 
smooth, functions, as shown by the fact that they are generalized splines. It is quite natural 
to ask whether the method can be modified to deal with piecewise constant functions, i.e., 
with classification tasks, and in particular boolean functions. More precisely, we have so far 
considered GRBFs as a technique to solve the problem of approximating real valued functions 
/ : R n -^ R m ] we ask now whether they can be specialized to deal with the problem of 
approximating functions h : R n -^ {0, 1} (the classification problem) and t : {0, l} n -^ {0, 1} 
(the boolean learning problem). 

Computer experiments show that without any modification, Gaussian radial basis func- 
tions can be used to learn successfully XOR (Broomhead and Lowe, 1988). A simple form of 
RBF was shown to perform well on the classification task of NetTalk (Wolpert, 1988). We 
expect therefore that classification tasks could be performed by the GRBF method described 
earlier using a basis of smooth functions. In a similar way, backpropagation networks with 
smooth sigmoid nonlinearities have been used in several classification tasks. Thus, it seems 
that the method can be used, as it stands, for classification tasks and for learning boolean 
functions. The question remains, however, whether a special basis of radial functions could 
be used advantageously. Consider the task of learning boolean functions. In backpropaga- 
tion networks, the boolean limit corresponds to the smooth sigmoid nonlinearities becoming 
linear threshold functions. The obvious boolean limit of radial basis functions is a basis of 
hyperspheres: 

S a ,d{%) — 1 for \\x — x a \\ < d else — (34) 

where S a ,d is the hypersphere of radius d centered in x a ^ x, d are boolean vectors and || • || is 
the Hamming distance. The use of such a basis may be similar to closest match classification 
(which can also be used for continuous mappings by using Euclidean metric instead of Ham- 
ming distance). Of course, this basis does not satisfy Micchelli's condition, and cannot be 
derived from regularization. It may be tempting to conjecture that arbitrarily close smooth 
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approximations exist that do. The obvious case of a difference of sigmoids, however, does 
not satisfy Micchelli's condition, since its first derivative is not completely monotonic: 

5(x) = l ' 



1 + ae x 1 + ae x (1 + a 2 ) + 2a cosh(x) 

In the boolean limit of backpropagation, one knows that a network with one hidden layer 
can represent any boolean function given a sufficient number of hidden units (because it is 
well known that any boolean function can be written as a threshold of linear combinations 
of threshold functions). The representation may require in general a large number of hidden 
units because it amounts to a disjunctive normal form of the boolean function. In a similar 
way, a basis of hyperspheres can be used to represent any boolean function by having a 
sufficient number of "centers," one for each term in the disjunctive normal form of the 
function. 

Seen in a more geometrical way (consider the case of a binary classification problem on i?), 
the boolean limit of RBFs carves the n-dimensional input space into hyperspheres, whereas 
the linear threshold limit for BP carves the space into regions bounded by hyperplanes. It 
seems clear that each of the partitions can be made to approximate the other arbitrarily 
well, given a sufficient number of hidden units and/or centers. 

6.2 GRBF and Backpropagation 

GRBF are similar, but not identical, to backpropagation networks with one hidden layer, 
since: a) they also have one hidden layer of smooth differentiable functions; b) they are 
updated by a gradient descent method (as backpropagation) that operates on two "layers" 
of parameters, the q and the t a ; c) the update moves the "centers" of those blurred hyper- 
spheres (in the Gaussian case) and their weight in the final layer, whereas in backpropagation 
"blurred" hyperplanes are shifted during learning. 

From the point of view of approximation and regularization theory GRBFs have solid 
theoretical foundations, a property that is not (yet) shared by backpropagation. However, 
some results on the approximating power of backpropagation networks have been recently 
obtained (Carrol and Dickinson, 1989; Cybenko, 1989; Funahashi 1989; Arai, 1989). Their 
essence is that a network with one hidden layer of sigmoid units can synthesize arbitrary 
well any continuous function, but may require a very large number of hidden units. Among 
other advantages relative to backpropagation networks, the simplest version of our scheme - a 
radial basis function network with centers fixed and centered at the examples - is guaranteed 
to perform quite well, to have an efficient implementation (there are no local minima in 
the error functional) and to be equivalent to a powerful approximation technique, that is 
generalized splines. Interestingly, this network may represent the initial step in a gradient 
descent procedure for synthesizing a more powerful GRBF network (compare section 5.3.1). 

6.3 GRBF and Vector Quantization 

The classification limit of GRBF (with the basis of equation (34) and || • || the euclidean 
distance, say) is clearly related to vector quantization (Gersho, 1982). Vector quantization 
of a signal vector / involves subdivision of the n-dimensional vector space into J decision 
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regions D 3 , each enclosing one of the J reconstruction values. The signal vector / is quantized 
to the reconstruction vector r 3 , if / lies in the decision region D 3 . In terms of equation 34, 
this mean that S 3 (f) — 1, and the domains in which the Si are nonzero are disjoint. 

6.4 GRBF and Kohonen's Algorithm 

Kohonen (1982) suggested an algorithm to establish a topology conserving and dimension- 
ality reduction map from a set of inputs in a high dimensional space. The algorithm has 
been suggested in order to describe maps similar to those that form between cortical areas. 
It has also been used for several other tasks, such as learning motor movements (Ritter and 
Schulten, 1986, 1987). Kohonen's algorithm can be regarded as a special form of the k-means 
method (MacQueen, 1967) for finding the centers of n clusters in a set of inputs. It turns 
out that this is what the update equation in the t a does, i.e. 

BE N 

— = Ac a J2 A t -/>'(||x t - - M 2 )(x, - t a ) (35) 

°^a i=l 

with Ai = constant. The differences with respect to Kohonen's algorithm are that a) each 
center is affected by all data points and not only by the ones that "belong" to it b) h! 
depends on ||x; — t a || rather than on the distance between the "locations" % and a. Notice 
that the addition of a "repulsive" terms in the functional to avoid overlapping centers make 
the analogy with Kohonen's algorithm even closer. 

Intuitively, the last equation with A; = constant adjusts the t a to the centers of the 
cluster of the data. This analogy suggests a heuristic scheme to improve convergence of the 
gradient descent method: first find the centers of the clusters with an algorithm like equation 
(35) with Ai = constant, then use the full scheme (equations (28) and (29)). The heuristic 
should help the method to avoid local minima. 

Moody and Darken (1989) propose a similar heuristic as the core of their method; they 
first find the position of the "centers" (they do not use, however, strict Radial Basis Func- 
tions) with the k-means algorithm and then find the coefficients q of the expansion. The 
k-means algorithm and Kohonen's are also related to vector quantization (see above). Other 
so-called competitive learning algorithms are similar to Kohonen's algorithm. 

6.5 GRBF, Kanerva's Model and Marr's Model of the Cerebel- 
lum 

Kanerva introduced in 1984 a memory model called Sparse Distributed Memory (SDM). 
Keeler (1988) has shown that the SDM can be regarded as a three-layer network with one 
layer of hidden units (see figure 4). In this description, the first layer consists of n boolean 
units that represent input vectors (binary addresses) a. The hidden layer consists of m 
binary units s (the select vectors) with m » n. The weights between the input layer 
and the hidden layer are given by the matrix A, with rows that correspond to the storage 
locations. Each output unit (in SDM there are n output units) is also binary with connection 
weights to the hidden units given by the matrix C, which is updated during learning by a 
Hebbian learning rule. Thus given an input address a, the selected locations are: 
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s = S 0id (Aa), 

where Sq^ was defined in section 6.1 (we assume x = 0). C is set to be the sum of the outer 
products of desired outputs d and selected locations, i.e. C — 2jdjsJ. 

Clearly the analogy with the boolean limit of RBF is complete: the matrix A contains 
the locations of the centers (the selected locations) and C corresponds to the c of RBF. In 
the SDM, the center locations are fixed. The Hebbian one-step update rule can be regarded 
as a zero-order approximation of the gradient descent scheme, which is equivalent, for fixed 
centers, to the pseudoinverse (see Appendix B.2). 

Input layer, n units 




m hidden units 



Output layer, n units 



Figure 4: Kanerva's SDM represented as a three-layer network. The matrix A of connections 
between the first layer and the hidden layer contains the locations of the m centers. The ma- 
trix C of modifiable connections between the hidden layer and the output layer corresponds 
to the coefficients c in the RBF formulation (redrawn from Keeler, 1988). 

Keeler has discussed the similarities between the SDM model and the Hopfield model. 
He has also pointed out that Kanerva's SDM is very similar in mathematical form to a model 
of the cerebellum introduced by Marr (1969) and Albus (1971). In our language, the mossy 
fibers are the input lines, the granule cells correspond to the centers (the granule cells are 
the most populous neurons in the brain), and the Purkinje cells correspond to the output 
units (the summation). Other cells, according to Marr and Albus, are involved in what 
we would call control functions; the basket cells that receive another input may inhibit the 
Purkinje cells, whereas the stellate cells could change the gain of the Purkinje cells or their 
threshold. The Golgi cells receive inputs from the granule cells (the centers), and feedback 
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into the mossy fiber - granule cell connections. It may be interesting to consider in more 
detail whether the circuitry of the cerebellum may have anything to do with the more general 
continuous update scheme described by equations (28) and (29). 

Keeler also suggests modifying the SDM to optimize the addresses A{ according to A^ ew = 
Af d — X(Af d — x). This is about the same as Kohonen's algorithm and reduces (for d — 1) to 
the unary representation of Baum, Moody and Wilczek (1988). Keeler provides interesting 
estimates of the performance of SDM. In conclusion, Kanerva's algorithm can be regarded as 
a special case of the boolean limit of GRBF, which again provides a more general framework 
and a connection with continuous approximation. 

6.6 Other Methods 

The GRBF formulation seems to contain as special cases two well-known schemes in the field 
of pattern recognition. One is Parzen windows (Parzen, 1962), which is an approximation 
scheme typically used to estimate probability distributions, and which is remarkably similar 
to a simple form of RBF (Duda and Hart, 1973). The other scheme is the method of 
potential functions for determining discriminant functions (Duda and Hart, 1973). The 
method was originally suggested by the idea that if the samples were thought of as points in a 
multidimensional space and if electrical charges were placed at these points, the electrostatic 
potential would serve as a useful discriminant function #(x) = ^(/;A^(x,X;), with K being 
typically a radial function (as classical electrostatic potentials are). Potentials such as the 
gaussian have also been used. GRBF (and RBF) may be used to give a more rigorous 
foundation to these two rather heuristic methods. 

GRBF has also similarities with the class of Memory-Based Reasoning methods, recently 
used by D. Waltz and coworkers on massively parallel machines, since in its simplest version 
(as many centers as examples) it is essentially a look-up table that finds those past instances 
that are sufficiently close to the new input. 

7 Gaussian GRBFs and science-fiction neurobiology 

In this section we point out some remarkable properties of gaussian Generalized Radial Basis 
Functions, that may have implications for neurobiology, for VLSI hardware implementations 
and, from a conceptual point of view, for extending in interesting directions the GRBF 
approach. 

7.1 Factorizable Radial Basis Functions 

The synthesis of radial basis functions in high dimensions may be easier if they are factor- 
izable. It can be easily proven that the only radial basis function which is factorizable is 
the Gaussian. A multidimensional gaussian function can be represented as the product of 
lower dimensional gaussians. For instance a 2D gaussian radial function centered in t can 
be written as: 

G(||x-t|| 2 ) = e-H^H 2 = e -(*-*-) 2 e -(*-*») 2 . (36) 
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This dimensionality factorization is especially attractive from the physiological point of 
view, since it is difficult to imagine how neurons could compute &(||x — t a || ) in a simple way 
for dimensions higher than two. The scheme of figure 5, on the other hand, is physiologically 
plausible. Gaussian radial functions in one and two dimensions can be readily implemented 
as receptive fields by weighted connections from the sensor arrays (or some retinotopic array 
of units representing with their activity the position of features). 




Figure 5: A three-dimensional radial gaussian implemented by multiplying two-dimensional 
gaussian and one- dimensional gaussian receptive fields. The latter two functions are syn- 
thesized directly by appropriately weighted connections from the sensor arrays, as neural 
receptive fields are usually thought to arise. Notice that they transduce the implicit position 
of stimuli in the sensor array into a number (the activity of the unit). They thus serve the 
dual purpose of providing the required "number" representation from the activity of the sen- 
sor array and of computing a gaussian function. 2D gaussians acting on a retinotopic map 
can be regarded as representing 2D "features", while the radial basis function represents the 
"template" resulting from the conjunction of those lower-dimensional features. 

Physiological speculations aside, this scheme has three interesting features from the point 
of view of a hardware implementation and also in purely conceptual terms. Consider the 
example of a GRBF network operating on images: 
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1. the multidimensional radial functions are synthesized directly by appropriately weighted 
connections from the sensor arrays, without any need of an explicit computation of the 
norm and the exponential. 

2. 2D gaussians operating on the sensor array or on a retinotopic array of features ex- 
tracted by some preprocessing transduce the implicit position of features in the array 
into a number (the activity of the unit). They thus serve the purpose of providing the 
required "number" representation from the "array" representation. 

3. 2D gaussians acting on a retinotopic map can be regarded as representing 2D "fea- 
tures", while each radial basis function represents the "template" resulting from the 
conjunction of those lower-dimensional features. Notice that in this analogy the radial 
basis function is the AND of several features and could also include the negation of 
certain features, that is the AND NOT of them. The scheme is also hierarchical, in the 
sense that a multidimensional gaussian "template" unit may be a "feature" input for 
another radial function (again because of the factorization property of the gaussian). 
Of course a whole GRBF network may be one of the inputs to another GRBF network. 

7.2 Style of Computation and Physiological Predictions 

The multiplication operation required by the previous interpretation of gaussian GRBFs to 
perform the "conjunction" of gaussian receptive fields is not too implausible from a biophys- 
ical point of view. It could be performed by several biophysical mechanisms (see Koch and 
Poggio, 1987). Here we mention three mechanisms: 

1. inhibition of the silent type and related circuitry (see Torre and Poggio, 1978; Poggio 
and Torre, 1978) 

2. the AND-like mechanism of NMDA receptors 

3. a logarithmic transformation, followed by summation, followed by exponentiation. The 
logarithmic and exponential characteristic could be implemented in appropriate ranges 
by the sigmoid-like pre-to-postsynaptic voltage transduction of many synapses. 

If the first or the second mechanism are used, the product of figure 5 can be performed 
directly on the dendritic tree of the neuron representing the corresponding radial function 
(alternatively, each dendritic tree may perform pairwise products only, in which case a log- 
arithmic number of cells would be required). The GRBF scheme also requires a certain 
amount of memory per basis unit, in order to store the center vector. In the gaussian case 
the center vector is effectively stored in the position of the 2D (or ID) receptive fields and 
in their connections to the product unit(s). This is plausible physiologically. The update 
equations are probably not. Equation (28) or a somewhat similar, quasi-hebbian scheme is 
not too unlikely and may require only a small amount of plausible neural circuitry. Equation 
(29) seems more difficult to implement for a network of real neurons (as an aside, notice that 
in the gaussian case the term /^(||x; — t a || 2 ) has an interesting form !). It should be stressed, 
however, that the centers may be moved in other ways - or not at all! In the gaussian case, 
with basis functions synthesized through the product of gaussian receptive fields, moving the 
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centers means establishing or erasing connections to the product unit. This can be done on 
the basis of rules that are different from the full equation 29, such as, for instance, compet- 
itive learning, and that are biologically more plausible. Computer experiments are needed 
to assess the efficacy of different strategies to learn the optimal position of the centers. 

The GRBF method with the Gaussian suggests an intriguing metaphor for a compu- 
tational strategy that the brain may use, in some cases. Computation, in the sense of 
generalization from examples, would be done by superposition of receptive fields in a multi- 
dimensional input space. In the case of Gaussian radial basis functions, the multidimensional 
receptive fields could be synthesized by combining lower dimensional receptive fields, pos- 
sibly in multiple stages. From this point of view, some cells would correspond to radial 
functions with centers in a high dimensional input space, somewhat similar to prototypes or 
coarse "grandmother cells", a picture that seems superficially consistent with physiological 
evidence. They could be synthesized as the conjunction of gaussian weighted positive and 
negative features in 2D retinotopic arrays. 

Notice that from this perspective the computation is performed by gaussian receptive 
fields and their combination (through some approximation to multiplication), rather than 
by threshold functions. The basis units may not even need to be all radial, as we discuss in 
the next section. The view is in the spirit of the key role that the concept of receptive field has 
always played in neurophsyiology. It predicts the existence of low-dimensional feature-like 
cells and multidimensional Gaussian-like receptive fields, somewhat similar to template-like 
cells, a fact that could be tested experimentally on cortical cells. 3 

7.3 An example: recognizing a 3D object from its perspective 
views 

To illustrate the previous remarks let us consider the following specific task in model-based 
visual recognition. 

A set of visible points on a 3D object Pi, P 2 ,- • • , P n maps under perspective projec- 
tion onto corresponding points on the image plane, whose coordinates will be indicated 
by (xi,j/i), (x 2 , J/2)r ** (^n, ]Jn)- The set of these coordinates represents a view. To each 
view j, obtained by a 3D rigid transformation of the object, we associate the vector v- 7 = 
(xj, 2/1,^2, lAi ' ' ' -> x n-> Vn) in a 2n-dimensional space. Following Basri and Ullman (1989), 
who had considered the simpler case of orthographic projection plus scaling (they prove 
the surprising result that any view can be obtained as the linear combination of a small 
number of other appropriate views), we would like to synthesize a system that takes any 
view of the specific object as input and provides a standard view v as output. The view 
of a different object would lead to something different from v . Is the task feasible? The 
following argument shows that it is. So called structure-from-motion theorems prove that 
for a moving rigid object as few as 2 perspective views and a small number of corresponding 



3 We conjecture that Andersen's data, modeled by Zipser and Andersen (1988) with a backpropagation 
network, may be better accounted for by a GRBF network. In the latter scheme (see previous discussion) 
the radial functions are the product of 2D gaussians representing the visual receptive fields and ID gaussians 
representing the eye position. This accounts immediately for the multiplicative property of the V7 cells 
found by Andersen. Their activities are then superimposed to obtain the desired mapping into head-centered 
coordinates. 
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points are sufficient for recovering the motion and the 3D structure of the object (the avail- 
able results, without a numerical stability analysis and under rather weak conditions, are: 8 
points (Longuet-Higgins, 1981) , 5 points (quoted by Longuet-Higgins, 1981), 7 points (Tsai 
and Huang, 1982) ). Thus a small number of perspective views of an object contains the 
necessary information for computing any perspective view of the same object (provided the 
points of interest are always visible). 

We can use GRBF to achieve this goal, in the following way. We have (see figure 5) 2n 
inputs to accommodate the vectors Vj and, say K radial basis functions initially centered 
each on one of a subset of the M views used to "learn" the system (M > A). Instead of 
one output only as shown in figure 5 the system will have 2n outputs corresponding to the 
components of the standard view v . Thus for each of the M inputs in the training set the 
desired output is v . In the simple case of "fixed centers" the network must attempt to 
satisfy 

V = GC 

where Vq is the (M, 27V) matrix whose rows are all equal to the standard view, C is the 
(K,2N) matrix of the coefficients c and G is the (M, A ) matrix G(||vj — t a || 2 ). The best 
solution in the least square sense is 

C = G+V . 

Computer simulations show that the GRBF network generalizes successfully to views that 
are not part of the training set (Poggio and Edelman, 1990). We are presently exploring 
several issues, such as how performance degrades with decreasing number of views and of 
radial basis functions. 

In general it is better to have as few centers as possible and move them by using equations 
26 and 27. Notice that each center corresponds to a view or to some "intermediate" view. 
Also notice that the required multidimensional gaussians can be synthesized by the product 
of two-dimensional gaussian receptive fields, looking in the retinotopic space of the features 
corresponding to P 1; P 2 , • * * ? Pn- There would be a two-dimensional gaussian receptive field 
for each view in the training set and for each feature P ? , centered at Pi (in the simple case of 
fixed centers), for a total of TV two-dimensional gaussian receptive fields for each of the M 
centers. We have synthesized a yes/no detector for a specific object, irrespectively of its 3D 
position, by subtracting from the output of the GRBF network the selected standard view, 
then taking the euclidean norm of the resulting error vector and thresholding it appropriately. 
All of this assumes that the correspondence problem between views is solved, a quite difficult 
proposition! 

Finally, we cannot resist the temptation of mentioning that the GRBF recognition scheme 
we have outlined has intriguing similarities with some of the data about visual neurons 
responding to faces obtained by Perrett and coworkers (see Perrett et al., 1987, Poggio and 
Edelman, 1989). 
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7.4 Gaussian GRBFs, Coarse Coding and Product Units 

A popular approach in the neural network literature is coarse coding. This technique assigns 
a receptive field to each computing unit. If the dimensions of the receptive fields are properly 
chosen, a point in the input space generally belongs to a number of different receptive fields, 
and is then encoded in a distributed way. An example of coarse coding is the "scalarized" 
representation adopted by Saund (1987) to perform dimensionality reduction by means of a 
network with one hidden layer. In his work a scalar value is represented by the pattern of 
activity over a set of units. The pattern of activity is determined by sampling a gaussian-like 
function G centered on the scalar value itself. If ti, ..., t n are the points at which the function 
G is sampled the "scalarized" representation of the value x is then the following: 

x => {G(x — ti), G(x — t 2 ), ..., G(x — t n )} . 

Once the input is represented in such a way, it is further processed by the hidden and the 
output layer, for instance of a BP network. If the input consists of more than one variable, 
each variable is "scalarized" separately. 

Clearly, the previous sections show that a gaussian GRBF network (with fixed centers) 
is equivalent to coarse coding followed by product units (also called Sigma-Pi units, see 
Rumelhart et al. 1986; Mel, 1988). An example is shown in figure 6 for a two dimensional 
input. The first layer represents the coarse coding stage with two-dimensional gaussian 
receptive fields. The second layer consist of product units that synthesize the gaussian 
radial basis functions. The output of these units is then summed with the weights Ci, c 2 and 
c 3 to give the value of the GRBF expansion. 

In this example the output of a product unit with inputs (xi, x 2 , ..., x n ) was simply 
y = nr=i x i- This is a particular case of the Product Units introduced by Durbin and 
Rumelhart (1989). The output of a Product Unit is in general y = Yli=i X T where the pi are 
exponents that have to be chosen. Substituting the units in the hidden layer with Products 
Units one obtains 

(e-(*-*i) 2 ) Pl ( e -(^) 2 ) P2 = e -M^) 2 +M^) 2 ] , i = 1,2,3 . (37) 

This gives an expansion on a set of non radial basis functions that can also be derived from 
regularization (see section 8.2). 

8 Some Future Extensions and Applications 

This section sketches some of the extensions and applications of GRBFs that will be discussed 
in detail in a forthcoming paper. 

8.1 The Bayes Approach to GRBFs 

Earlier in the paper we have shown that the regularization principle is strictly related to 
the prior assumptions on the mapping to be learned and to the noise distribution in the 
examples (section 3.3). Because of the "equivalence" between GRBFs and regularization, it 
follows that the form of the radial basis function - its shape and its parameters, such as scale 
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F(X, Y) 



Figure 6: A GRBF network with three knots, which can be used for the reconstruction of 
a two dimensional function. The first layer consists of Gaussian units, and implements 
one dimensional receptive fields. The second layer consists of product units and is used to 
recover the two-dimensional receptive field. The output is a weighted sum of the outputs of 
the product units. 
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- depends on the prior, that is on the a priori knowledge of the smoothness properties of 
the mapping to be learned. The prior is in turn related to the complexity of the hypothesis. 
This point of view seems quite powerful both as an interpretation of GRBFs and as an 
approach for generalizing the GRBF scheme in several directions including, for instance, 
dimensionality regularization (Girosi and Poggio, 1989). Another important generalization 
has to do with the type of a priori knowledge that is assumed. The approach of this paper 
deals only with assumptions about smoothness of the mapping to be learned. Assumptions 
of this type are quite weak - in terms of required a priori knowledge - but powerful - in 
terms of their implications (see Stone, 1982). It is natural to ask whether it is possible to 
extend the approach of this paper to other types of prior assumptions. 

8.2 Generalizing GRBFs to HyperBFs 

The GRBF network that we have derived consists of radial function all of the same type. 
In the gaussian case all units have the same a which is set by the stabilizer, that is by the 
prior assumptions about the function to be learned. It is natural to attempt to write a more 
general expansion than equation (19) in terms of, say, gaussians with a depending on t a , 
replacing a by o a . Gradient descent could then be performed on cr a , in addition to t a and c a . 
Moody and Darken (1989) have reported successful testing of a method somewhat similar 
to radial basis functions in which the a of the gaussian units is determined from the data 
and is different from unit to unit. Somewhat surprisingly, it seems very difficult to derive 
rigorously from regularization an expansion of this type. 4 

We found, however, a different solution to the multiscale problem that is consistent with 
regularization and, at the same time, generalizes the GRBF scheme beyond radial functions. 
We call the resulting expansions and the associated networks Hyper Basis Functions. In this 
section we sketch the main results. 

The main idea is to consider the mapping to be approximated as the sum of several 
functions, each one with its own prior, that is stabilizer. The corresponding regularization 
principle then yields a superposition of different Green's functions, in particular gaussians 
with different a. Thus, instead of doing gradient descent on o a the network has a "repertoire" 
of different a at the same positions and chooses which ones to use through the corresponding 
coefficient. In more detail, the function / to be approximated is regarded as the sum of p 
components / m , m = 1, . . . , j>, each component having a different prior probability. Therefore 
the functional H[f] to minimize will contain p stabilizers P m and will be written as 

N p v 

H[f] = j2(Y,f m (^)-y>) 2 +Y,U\P m f m \\ 2 • (38) 

i=l m = l m = l 

From the structure of equation (38) it is clear that the exact solution will be a linear super- 
position of linear superpositions of the Green's functions G m corresponding to the stabilizers 
pm rpj^ a pp roac } 1 f section 5.2 - of using less radial functions than examples (with centers 



4 In principle, optimal cr(x) could be found as the solution to a variational problem in cr that contains 
the standard regularization problem: find the <r, and then the stabilizer, such that the solution of the 
regularization problem is optimal. The problem is related to the optimization problems involving MRFs 
with line processes (Geiger and Girosi, 1989), and to cross-validation techniques (Craven and Wahba, 1979; 
Wahba, 1977). 
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generally different from the examples) - can be naturally extended to deal with this class 
of functionals (Girosi and Poggio, 1989a). The approximated solution to the minimization 
problem defined by equation (38) is written as (the HyperBF expansion) 

F(x)=tt c :G>;C). (39) 

m = l a = l 

In particular, this method leads to radial basis functions of multiple scales for the reconstruc- 
tion of the function /. Suppose we know a priori that the function to be approximated has 
components on a number p of scales <Ti, . . . , o v \ we can use this information to choose a set 
of p stabilizers whose Green's functions are, for example, Gaussians of variance <Ti, . . . , o v . 
According to example 1 of section (5.1.2) we have: 



to - -^ n 



P m / m || 2 = V< / dx(D"7 m (x)) 2 (40) 



2k 

where a™ — -^. As a result, the solution will be a superposition of superpositions of 
gaussians of different variance. Of course, the gaussians with large a should be preset, 
depending on the nature of the problem, to be fewer and therefore on a sparser grid, than 
the gaussians with a small a . 

The method yields also non-radial Green's functions - by using appropriate stabilizers - 
and also Green's functions with a lower dimensionality - by using the associated f m and P m 
in a suitable lower-dimensional subspace. Again this reflects a priori information that may 
be available about the nature of the mapping to be learned. In the latter case the information 
is that the mapping is of lower dimensionality or has lower dimensional components (see the 
problem of Saund, 1987). 

Algorithms to perform gradient descent on the expansion equation (39) are the same 
that can be used for GRBFs (see for instance equations (26) and (27)). The additional 
"repulsive" term in the functional H[f] to be minimized, introduced earlier, should now 
consist of the sum of pairwise interactions among centers of the same type only, for instance 
among gaussians with the same a. In equation 39 two identical basis functions with the 
same center are clearly redundant, but two different basis functions in the same position, 
for instance two gaussians with a different scale, may actually make sense. This amounts to 
a form of an exclusion principle for basis units of the same type (the index m in equation 
(40), see Girosi and Poggio, 1989a). 

Notice that an efficient heuristic scheme may want to create basis units in regions of the 
input space where the network is not performing well and annihilate basis units that are not 
activated much. This common sense heuristics fits perfectly within the formal framework: 
creation of a new unit means moving the center of an existing, remote one to an useful initial 
position; annihilation is equivalent to a value for the associated coefficient c. 

One class of more efficient algorithms than simple gradient descent are multigrid tech- 
niques. Multiresolution techniques have been used in many fields such as numerical analysis 
(Brandt, 1977) and computer vision (Rosenfeld and Kak, 1982). The idea is to find a solu- 
tion at a coarse scale, then use this coarse solution as a starting point for higher resolution 
descriptions that can then be used to refine the coarse levels in an iterative procedure that 
proceeds from coarse to fine and back. Multigrid methods for solving partial differential 
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equations work in this general way. The same basic idea could be used with multiple-scales 
HyperBF: a coarse description is first obtained by doing gradient descent only on the few 
gaussian functions with large a . A finer description is then derived by approximating the 
difference between the data and the coarse approximation with more and smaller gaussian 
functions (the finer description may be used only in a subregion of the input space, some- 
what similarly to a "fovea"). The whole process is then iterated again in a sequence of 
error correction steps, from coarse to fine and from fine to coarse. The process should be 
equivalent to doing gradient descent on the all set of basis functions at once, but it could be 
computationally more efficient and even more capable of avoiding local minima (depending 
on the problem). Ideas of a similar flavor have been recently suggested by Moody (1989). 

8.3 Nonlinear input and output coding 

We have mentioned already that the HyperBF scheme when used for classification tasks 
could include an invertible nonlinear function a of the output without affecting the basic 
technique. Clearly a similar nonlinear function a could be applied to each of the inputs: the 
HyperBF approximation would be then performed on the transformed inputs to yield the o~ x 
transformation of the given output. It seems possible that in some cases suitable input and 
output processing of this type may be advantageous. Is there a general reason for it? Poggio 
(1982), following a forceful argument by Resnikoff (1975), has argued that the input and 
the output of the mapping to be approximated should be processed by a nonlinear function 
in order to match the domain and the range of the approximating function. Resnikoff had 
proposed as nonlinear functions for this processing the birational functions, the exponential 
function, the logarithmic function and the composition of this functions, since they achieve 
the necessary conversion of domain and range with minimal disruption of the algebraic 
structure of the input and output spaces. Input and output coding of this type tries to 
linearize the approximation as much as possible by exploiting a priori information about 
the range and the domain of the mapping to be approximated. Interestingly, the sigmoid 
function used at the output of many neural networks can be derived from the composition 
of a rational function and an exponential and matches the range of functions used for binary 
classification. 

8.4 Learning Dynamical Systems 

HyperBF can be used to "learn" a dynamical system from the time course of its output. 
In fact, RBF have been suggested often as a good technique for this problem and have 
been successfully tested in some cases (Broomhead and Lowe, 1988; Casdagli, 1989). The 
technique involves the approximation of the "iterated map" underlying the dynamical system 
(the crucial problem is, of course, the estimation of the dimension of the attractor and the 
choice of the input variables, see Farmer et al., (1983) for a good review). We have every 
reason to believe that HyperBF will perform on this problem at least as well as the linear 
techniques of Farmer and Sidorowich (1988) and the backpropagation algorithm of Lapedes 
and Farber (1988). 
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8.5 Learning the Dynamical System Underlying a Complex Map- 
ping 

Dynamical systems (in particular discrete time systems, that is difference equations) are 
Turing universal (the game "Life" is an example that has been demonstrated to be Turing 
universal). Thus a dynamical system such as the feedback system shown in figure 7, where 
(x , £1,..., x n ) are the signal values at time t , £1, . . . , t n , is equivalent to a finite state ma- 
chine. Clearly the mapping f(x ] x\ . . . x n ) which is iterated by the feedback system can be 
approximated by HyperBFs. This offers the possibility of "learning" a computation more 
complex than a boolean function, if examples with sufficient information are provided. 

In the following, consider the simpler case of learning a mapping F between an input 
vector x and an output vector y = F(x) that belong to the same n-dimensional space. The 
input and the output can be thought as the asymptotic states of the discrete dynamical 
system obtained iterating some map /. In several cases, the dynamical system that asymp- 
totically performs the mapping may have a much simpler structure than the direct mapping 
F. In other words, it is possible that the mapping / such that 

limf {n \x) = F(x) (41) 

is much simpler than the mapping y = F(x) (here f^ n \x) means the n-th iterate of the map 

/)■ 

A concrete example is the cooperative stereo algorithm (Marr and Poggio, 1976) that 
maps a 3D set of possible matches into the 3D set of true matches. The HyperBF technique 
can then be used to approximate / rather than F. Each update is done after iterating 
equation (41) with a "large enough" n, by using the obvious generalization of equation (28) 
and (29) (see Girosi and Poggio, 1989a). 

8.6 Multilayer networks 

In a somewhat similar spirit to the previous section, multilayer radial basis function networks 
(with more than one hidden layer) may be used for functions that happen to be well repre- 
sentable as the composition of functions that have a simple radial basis functions expansion. 
There is no reason to believe, however, that such "multilayer" functions represent a large 
and interesting class. 

Especially with backpropagation networks, researchers have often argued for several layers 
of hidden units. From the point of view of HyperBF, one layer is needed (the basis functions 
themselves), but there is no obvious sense in additional layers. We believe that this is true 
in general for single networks. 

On the other hand, there is a very good reason for parallel and hierarchical systems 
consisting, for instance, of several HyperBF modules connected together: the first network, 
for instance, may synthesize features that are used by one or more other modules for different, 
specific tasks. 
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Figure 7: A HyperBF network used to approximate a mapping from (x , #i, x 2 ) to (j/ , J/i, J/2) 
by approximating the dynamical system that maps asymptotically the input into the output. 
The input is given and the feedback network is iterated a sufficient number of times. Then 
the HyperBF update equations are used. The same procedure is used for new examples. 
The goal is to approximate the dynamical system that gives asymptotically the desired 
input-output relation. 
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8.7 Learning Perceptual and Motor tasks 

HyperBFs have a good chance of being capable of synthesizing several vision algorithms from 
examples, since (a) it is known that several problems in vision have satisfactory solutions in 
terms of regularization and (b) HyperBF networks are essentially equivalent to regularization. 
We suggest that the use of HyperBFs is not restricted to sensory processes and that they may 
also be used to learn motor tasks (Mel, 1988 has demonstrated a scheme for learning a simple 
motor task which can be considered a special case of GRBFs) and even to model biological 
motor control. In support of this latter point, notice that simple biological trajectory control 
seems to be well explained by variational formulations of the regularization type (Flash and 
Hogan, 1985). HyperBF networks are equivalent to regularization and may have attractive 
neural interpretations: basis functions, possibly radial, may correspond to motor units with 
a multidimensional motor field, whereas their sum may be implicitly performed by the whole 
mechanical system, say a multijoint arm. HyperBFs may also be used for simple forms of 
trajectory planning. Potential methods, which associate to obstacles appropriate repulsive 
forces and obtain in this way an overall potential field that can be used in driving the 
trajectory, have been used with some success (for instance by Ahuja, 1989). These methods 
are the motor analog of the potential methods used for pattern recognition (see section 6.6). 
Clearly, HyperBFs seem naturally suited to learn these fields from sets of examples. In 
any case, computer experiments will have to be performed to assess the performance of the 
HyperBF approach. 

8.8 Learning Input Features 

The theoretical framework we have developed in this paper does not say anything about 
what is probably the most difficult problem in learning a mapping, that is the choice of 
the input representation. In the HyperBF scheme, with gaussian basis functions, gradient 
descent on the centers positions and on the coefficients c a can be regarded as equivalent 
to learning which features to combine in prototypes by selecting the useful combinations. 
The problem of the elementary features, however, is outside the regularization framework, 
which is at the basis of HyperBF. Substantial extensions of our approach, for instance using 
the Bayes point of view, or different approaches, such as genetic-like algorithms, are clearly 
required. 

9 Conclusions 

9.1 How HyperBFs really work 

HyperBF have a rather simple structure that seems to capture some of the main lessons 
that are becoming evident in the fields of statistics and neural networks. HyperBF can be 
regarded as a process involving three coupled stages: 

• a stage which finds centers of clusters of training examples in the associated n-dimensional 
space 

• the coding of the inputs by the K basis functions 
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• a stage that performs an optimal linear mapping from the K-dimensional space of basis 
functions into the desired output vector 

The first stage can be regarded as similar to simple and efficient clustering algorithms. 
The third stage is classical: optimal linear mappings by themselves work well in many situ- 
ations. Our approach to HyperBF shows that this sequence of stages is not just attractive 
heuristics but derives rigorously from regularization and is thereby solidly grounded in ap- 
proximation theory. The theory says that the basis functions provide a sufficient nonlinear 
coding of the input for linear approximation to work (see Poggio, 1982). In addition, it says 
how to couple these various stages in a single procedure. 

To have a feeling of how HyperBFs work let us consider a specific, extreme case, in 
which we consider a HyperBFs network as a classifier, something the formal theory does not 
actually allow. Imagine using a HyperBF scheme to classify patterns, such as handwritten 
digits, in different classes. Assume that the input is a binary a 8-bit vector of length TV 
and each of the basis functions is initially centered on the point in the N-dimensional input 
space that corresponds to one of the training examples (fixed centers case). The system has 
several outputs, each corresponding to one of the digit classes. Let us consider a series of 
special cases of HyperBF of increasing generality: 

1. Each of the unit (its center corresponds to an example) is an hypersphere (see equation 
(34)) and is connected, with weight 1, to its output class only. Classification is done 
by reading out the class with maximum output. In this case, the system is performing 
a Parzen window estimate of the posterior probability and then using a MAP crite- 
rion. The Parzen-window approach is similar (and asymptotically equivalent) to the 
k n nearest-neighbor estimation, of which the nearest-neighbor rule is a special case. In 
this special case the network is equivalent to a hypersphere classifier. 

2. We now replace the hypersphere by a multidimensional gaussian that is an allowed ra- 
dial basis function (the hypersphere does not satisfy Micchelli's condition and cannot 
be derived from regularization). At least for the task of approximating smooth func- 
tions the network should perform better than in the non-gaussian case. The centers of 
the radial basis functions may be regarded as representing "templates" against which 
the input vectors are matched (think, for instance of a radial gaussian with small <r, 
centered on its center, which is a point in the n-dimensional space of inputs). 

3. We may do even better by allowing arbitrary c values between the radial units and the 
output. The c can then be found by the pseudoinverse technique (or gradient descent) 
and are guaranteed to be optimal in the l 2 sense. 

4. We now allow a number of (movable) centers with radial basis functions. This is the 
GRBF scheme. Moving a center is equivalent to modifying the corresponding template. 
Thus equation (29) attempts to develop better templates by modifying during training 
the existing ones. In our example, this means changing the pixel values in the arrays 
representing the digits. 

5. Finally the most general network contains radial units of the gaussian type of different 
scale (i.e. cr), together with non-radial units associated to appropriate stabilizers and 
units that receive only a subset of the inputs. 
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This list shows that the HyperBF scheme is an extension of some of the simplest and 
most efficient approximation and learning algorithms which can be regarded as special cases 
of it. In addition, it illuminates a few interesting aspects of the HyperBF algorithm, such as 
its massive parallelism and its use of prototypes. The network is massively parallel in the 
sense that it may in general require a large number of basis units. While this property could 
have been regarded quite negatively a few years ago, this is not so anymore. The advent 
of parallel machines such as the Connection Machine with about 65,000 processors and of 
special purpose parallel hardware has changed the perspective towards massive parallelism. 
The use of prototypes by HyperBFs suggest that, in a sense, HyperBFs networks are an 
extension of massive template matchers or look-up tables. We believe that this property 
makes them intriguingly attractive: after all, if memory is cheap, look-up tables are a good 
starting point. The HyperBF scheme says how to extend look-up tables to do as well as 
possible in terms of approximating a multidimensional function about which very little is 
known. 

9.2 Networks and Learning: The Pervasive Problem of Dimen- 
sionality 

Our main result shows that the class of feedforward multilayer networks identical to Hy- 
perBF are equivalent to generalized splines, and are capable of well approximating a smooth 
function. This is highly satisfactory from a theoretical point of view, but in practice another 
fundamental question must also be addressed: how many samples are needed to achieve a 
given degree of accuracy (Stone, 1982; Barron and Barron, 1988)? It is well known that 
the answer depends on the dimensionality d and on the degree of smoothness p of the class 
of functions that has to be approximated (Lorentz, 1966, 1986; Stone, 1982, 1985). This 
problem has been extensively studied and some fundamental results have been obtained by 
Stone (1982). He considered a class of nonparametric estimation problems, like surface ap- 
proximation, and computed the optimal rate of convergence e n , that is a measure of how 
accurately a function can be approximated knowing n samples of its graph. He showed 
that using a local polynomial regression the optimal rate of convergence e n = n~ 2 ^+ d can be 
achieved, generalizing previous results based on local averages. This means that the number 
of examples needed to approximate a function reasonably well grows enormously with the 
dimension d of the space on which it is defined, although this effect is mitigated by a high 
degree of smoothness p (in fact e n depends only on the ratio -). For instance in the case of a 
twice differentiable function of two variables, 8000 examples are needed to obtain e n = 0.05, 
but if the function depends on 10 variables the number of examples necessary to obtain the 
same rate of convergence grows up to 10 9 . However, if a function of 10 variables is 10 times 
differentiable 8000 examples will be enough to obtain e n = 0.05. 

Interestingly these results lead to an a posteriori justification of the smoothness assump- 
tion that plays a central role in standard regularization. In fact, when the number of di- 
mensions becomes larger than 3 or 4, one is forced to assume a high degree of smoothness: 
otherwise the number of examples required will be so large that the approximation task 
becomes hopeless. This justifies the use of high-order stabilizers such as the gaussian (the 
stabilizer is equivalent to the prior, see earlier) for high dimensional spaces, whenever no 
other type of prior information about the function to be approximated is available. Notice 
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that in the case of an infinite degree of smoothness the optimal rate of convergence tends 
to an asymptotic value. In fact it is straightforward to check that linip^oo e n = n~2 . In this 
case it is possible to obtain e n = 0.05 with just 400 examples. It must be noticed however 
that the set of functions with infinite degree of smoothness is a very small subset of the space 
of the continuous functions. 

The results stated above are optimal, but it is not guaranteed that they are achievable 
with HyperBF. The similarity of HyperBF with splines is encouraging, however, since splines 
has been proved to be optimal in this sense (Cox, 1984). Our results also suggest that most 
of the networks proposed in the recent literature, since they are similar to the HyperBF nets, 
are likely to have the same fundamental limitations, perhaps even more strongly, from the 
point of view of sample complexity. 

Other interesting results have been obtained by Baum and Haussler on the statistical reli- 
ability of networks for binary classification (Baum, 1988; Baum and Haussler, 1989). They 
use the concept of Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis, 1971) in the 
network context to give the probability of error on new data given the error on the train- 
ing data. This approach is different from the one pursued in approximation and regression 
theory, since they do not estimate a priori the accuracy of the network. These results do 
not directly apply to our case, but, since the concept of Vapnik-Chervonenkis dimension 
has been shown to be very powerful, we think it will also be relevant in the context of the 
HyperBF method. 

Of course this analysis requires that the true dimension of the data set is known, which is 
not always the case. Especially when the number of dimensions is large, not all of them are 
relevant: the problem of "dimensionality reduction" is how to find the relevant dimensions. 
A solution to the problem consists in considering the dimension as another random variable 
that has to be estimated from data (see Girosi and Poggio, 1989). A priori knowledge about 
the number of dimensions can be embedded in the MAP estimator, to make, for instance, 
low dimensionality solutions more likely than others. 

Another approach to dimensionality reduction has been pursued by J. Schwartz (1988), and 
has been shown to be not very different from ours (Girosi and Poggio, 1989). He solves 
the learning problem for many data sets, obtained from the original one dropping some 
dimensions, and then selects the one that gives the best result. This method is more similar 
to Generalized Cross Validation (Wahba, 1977; Craven and Wahba, 1979) and even without 
a priori information on the dimensionality of the problem, turned out to be effective in 
computer simulations (Schwartz, 1988). 

9.3 Summary 

Approaching the problem of learning in networks from the point of view of approximation 
theory provides several useful insights. It illuminates what network architectures are doing; 
it suggests more principled ways of obtaining the same results and ways of extending further 
the approach; and finally, it suggests fundamental limitations of all approximation methods, 
including so called neural networks. 

In this paper, we developed a theoretical framework based on regularization techniques 
that led to a class of three-layer networks, useful for approximation, that we call Generalized 
Radial Basis Functions (GRBF), since they are closely related to the well-known Radial 
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Basis Functions, mainly used for strict interpolation tasks. We have introduced several new 
extensions of the method and its connections with splines, regularization, Bayes formulation 
and clustering. GRBFs have a direct interpretation in terms of a feedforward, multilayer 
network architecture. Unlike neural networks they have good theoretical foundations. They 
may provide the best framework within which we can study general issues for learning tech- 
niques of the neural network type. 
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A Kolmogorov's and Vitushkin's Theorems 

In this section we discuss the problem of the exact representation of continuous multivariate 
functions by superposition of univariate ones. Some results on this topic will be reported as 
well their interpretation in the framework of the multilayer networks. 

This problem is the thirteenth of the 23 problems that Hilbert formulated in his famous 
lecture at the International Conference of Mathematicians in Paris, 1900. Although his orig- 
inal formulation dealt with properties of the solution of a seventh degree algebraic equation 
this problem can be restated as follows: 

prove that there are continuous functions of three variables, not representable as super- 
positions of continuous functions of two variables 

The definition of superposition of functions can be found in Vitushkin (1954), but is quite 
cumbersome and to our aims is sufficient to define it as the usual composition. For example 
the function xy is the superposition of the functions g(-) — exp(-) and h(-) — log(-), since 
we can write 

xy = e {l09X+l09y) = g(h(x) + %)). 

In 1957 Kolmogorov and Arnol'd showed the conjecture of Hilbert to be false: by Kol- 
mogorov 's theorem any continuous function of several variables can represented by means 
of a superposition of continuous functions of a single variable and the operation of addition. 
The original statement of Kolmogorov (1957) is the following: 

Theorem A.l There exist fixed increasing continuous functions h pq (x), on I — [0, 1] so that 
each continuous function f on I n can be written in the form 

2n+l n 

f(x U ...,X n )= J2 3 q (J2 h Pq( X p))i 
q = l p=l 

where g q are properly chosen continuous functions of one variable. 

Since the original formulation of Kolmogorov many authors have improved the repre- 
sentation of theorem A.l. The main results are concerned with the possibility of replacing 
the functions g q by a single function g (Lorentz, 1962) and of writing h pq as l p h q (Sprecher, 
1964). Here we report a formulation due to Kahane (1975) whose proof does not need the 
construction of the functions h q , relying instead on the Baire's theorem. We first give some 
definitions. We will say that a statement is true for quasi every element of a complete met- 
ric space M if it is true on the intersection of a countable family of open sets which are 
everywhere dense in M. Let H be the space with uniform norm consisting of all functions 
continuous and non decreasing on the segment / and H k — H X ... X H the k-th power of 
the space H.The following theorem then holds: 

Theorem A. 2 Let l p (p — l,...,n) be a collection of rationally independent constants. Then 
for quasi every collection {/fci, ..., &2n+i} £ H 2n+1 it is true that any function f £ C(I n ) can 
be represented on I n in the form 
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2n+l n 

q=l p=l 



where g is a continuous function. 



X 




Figure 8: The network implementation of Kahane 7 s version of Kolmogorov 7 s theorem 

The representation of this formula has been graphically depicted in figure 8: it is evident 
the connection between this superposition scheme and a multilayer network architecture. 
In this framework this result could seem very appealing : the representation of a function 
requires a fixed number of nodes, smoothly increasing with the dimension of the input 
space. Unfortunately further studies on this subject showed that these results are somewhat 
pathological and their practical implications very limited. The problem lies in the inner 
functions of Kolmogorov's formula: although they are continuous, some results of Henkin 
and Vitushkin (1967) prove that they must be highly non-smooth. Since the number of binary 
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digits required to code a function with a given accuracy is inversally proportional to its degree 
of smoothness, as a result a very poor accuracy will be obtained implementing Kolmogorov's 
formula with finite precision. One could ask if it is possible to find a superposition scheme 
in which the functions involved are smooth. The answer is negative, even for two variable 
functions, and was given by Vitushkin with the following theorem (1954): 

Theorem A. 3 There are r (r — 1,2, ...) times continuously differentiate functions ofn > 2 
variables, not representable by superposition of r times continuously differ entiable functions 
of less than n variables; there are r times continuously differ entiable functions of two variables 
which are not representable by sums and continuously differ entiable functions of one variable. 



We notice that the intuition underlying Hilbert's conjecture and theorem A. 3 is the same: 
not all the functions with a given degree of complexity can be represented in simple way by 
means of functions with a lower degree of complexity. The reason for the failing of Hilbert's 
conjecture is a "wrong" definition of complexity: Kolmogorov's theorem shows that the 
number of variables is not sufficient to characterize the complexity of a function. Vitushkin 
showed that such a characterization is possible and gave an explicit formula. Let / be a 
r times continuously differentiable function defined on I n with all its partial derivatives of 
order r belonging to the class Lip[0, l] a . Vitushkin puts 

r + a 

X = 

n 
and shows that it can be used to measure the inverse of the complexity of a class of functions. 
In fact he succeded in proving the following: 



£0. 
/ 

by superpositions of functions of characteristic x — f > Xo, 9 ^ 1- 



Theorem A. 4 Not all functions of a given characteristic \o — f 2- > can be represented 



k 
Theorem A. 3 is easily derived from this result. 

B Linear Approximation (1-Layer Networks) 

Let us assume that the mapping between a set of input vectors y and a set of output vectors 
z is linear (for an example, see Hurlbert and Poggio, 1988). How can we estimate the linear 
mapping from a set of examples? We start by arranging the sets of vectors in two matrices 
Y and Z. The problem of synthesizing the linear operator L is then equivalent to "solving" 
the following equation for L: 

Z = LY (42) 

A general solution to this problem is given by 

L = ZF+, (43) 

where Y + is the pseudoinverse of Y. This is the solution which is most robust against errors, 
if equation (42) admits several solutions and it is the optimal solution in the least-squares 
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sense, if no exact solution of equation (42) exists. This latter case is the one of interest to 
us: in order to overconstrain the problem, and so avoid look-up table solutions, we require 
that the number of examples (columns of Y) be larger than the rank of the matrix L. In 
this case, there is no exact solution of equation (42) and the matrix L is chosen instead to 
minimize the expression 

M = \\LY - Z\\ 2 . (44) 

L may be computed directly by minimizing (44), which yields 

L = ZY T (YY T y 1 (45) 

In practice, we compute L using equation (45), but first regularize it by adding a stabi- 
lizing functional to obviate problems of numerical stability (Tikhonov and Arsenin, 1977). 

B.l Recursive Estimation of L 

It is of particular importance for practical applications that the pseudoinverse can be com- 
puted in an adaptive way by updating it when new data become available (Albert, 1972). 
Consider again equation (45). Assume that the matrix Y consists of n — 1 input vectors and 
Z of the corresponding correct outputs. We rewrite equation (45) as 

L n . x = Z n _ x Y+_ x (46) 

If another input-output pair y n and z n becomes available, we can compute L n recursively by 

L n = L n _ X + (Z n - ^n-\VnY T a , (47) 

where 

provided that (l / ^_il / ^ 1 )~ 1 exists (i.e., that the number of columns in Y is greater than or 
equal to the dimension of y). The case in which (l / ^_il / ^ 1 )~ 1 does not exist is discussed 
together with more general results in Albert (1972). Note that (z n — L n _iy n ) in the updating 
equation (47) is the error between the desired output and the predicted one, in terms of the 
current L. The coefficient t n is the weight of the correction: with the value given by equation 
(48) the correction is optimal and cannot be improved by any iteration without new data. 
A different value of the coefficient is suboptimal but may be used to converge to the optimal 
solution by successive iterations of equation (48) using the same data. 
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B.2 Optimal Linear Estimation, Regression and Bayesian Esti- 
mation 



The optimal linear estimation scheme we have described is closely related to a special case of 
Bayesian estimation in which the best linear unbiased estimator (BLUE) is found. Consider 
equation (42): the problem is to construct an operator L that provides the best estimation 
Ly of z. We assume that the vectors y and z are sample sequences of gaussian stochastic 
processes with, for simplicity, zero mean. Under these conditions the processes are fully 
specified by their correlation functions 

E[yy T ] = C yy , E[zy T ] = C zy (49) 

where E indicates the expected value. The BLUE of z (see Albert, 1972) is, given j/, 

z est = CzyC^y, (50) 

which is to be compared with the regression equation 

Ly = ZY T (YY T )- 1 y. (51) 

The quantities ZY T and YY T are approximations to C zy and C yy ^ respectively, since the 
quantities are estimated over a finite number of observations (the training examples). Thus 
there is a direct relation between BLUEs and optimal linear estimation. The learned operator 
captures the stochastic regularities of the input and output signals. Note that if the input 
vectors y are orthonormal, then L = ZY T and the problem reduces to constructing a simple 
correlation memory of the holographic type (see Poggio, 1975). Under no restrictions on the 
vectors j/, the correlation matrix ZY T may still be considered as a low-order approximation 
to the optimal operator (see Kohonen, 1978). 

C Polynomial Approximation 

A natural extension of linear estimation is polynomial estimation. The Weierstrass-Stone 
theorem suggests that polynomial mappings, of which linear mappings are a special case, 
can approximate arbitrarily well all continuous real functions. There are function space 
equivalents of Weierstrass' theorem. In our case, defining X • X • X (n times) as the set of 
linear combinations of the monomials of degree n in the components (Xi, . . . , Xk) of X, we 
simply look for the multivariate polynomial given by L(X) — L + L\X + L 2 X • X + L 3 X • 
X • X + . . . that minimizes \\y — L(X)\\. The problem is equivalent to finding the optimal 
linear estimator on the "expanded" space consisting of all the products of the components 
of X. The monomials such as Xi,XiX 2 , or X^X^Xs • • • become the new features that are 
input to an optimal linear estimator. The resulting polynomial estimator can approximate 
any continuous mapping. In the case of optimal polynomial estimation, the crucial problem 
is computational complexity, which translates into the computation of the pseudoinverse of 
very large matrices y t (see equation (43)). As a (small) step toward simplifying this problem, 
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one can think of computing the optimal Z , then the optimal L\ connection, and so on in a 
sequence, up to L n . Poggio (1975) proposes that the optimal polynomial estimator of order 
n can be found by an iterative procedure that first find the optimal linear estimator, then the 
optimal second-order correction up to the optimal n-order correction and then back again 
to the linear correction and so on. A theorem (Poggio, 1975) guarantees convergence of the 
procedure to the optimal n-th order estimator. 

Recently, it has been demonstrated that a multilayer network can represent exactly any 
polynomial mapping, given polynomial output functions of at least order two for each unit 
and enough hidden units and layers (Moore and Poggio, 1988). 

D The Relation between Regularization and Bayes 
Estimation 

The problem of hypersurface reconstruction - and therefore of learning - can be formulated 
as a Bayesian estimation problem. The goal is to estimate the a posteriori probability P z /d 
of the solution z given the data d and use the estimate according to a chosen performance 
criterion . Bayes theorem yields 

P(z/d) oc P(z)P(d/z) 

where P(z) is the prior probability density of the process associated with z and P(dj z) is 
the conditional probability density of the data d given the hypersurface z. 

We consider now the special case of z (or equivalently the result of the action of any 
differential operator P on it) being a gaussian process. In this case, the a priori probability 
distribution of z is 

P(z) oc e -K^) 



where (•,•) is a scalar product in the space to whom z belongs and P is the adjoint of 
the operator P. Let us assume that the noise process affecting the data d taken from z is 
additive, white and gaussian with variance a 2 . Then the conditional probability P(dj z) can 
be written as 



P(d/z) oc e 2a2 



2_IU_^l|2 



d\\ 2 



where || • || is the norm induced by the scalar product (•,•). Depending on the optimality 
criterion there are now several ways of obtaining the best estimate of z given the data d. A 
commonly used estimate is the Maximum A Posteriori (MAP) estimate 

^(^best/^) = max {^P(^/^)k £ Z} 
In our case the following holds 

\ z hestl ) ~ maxe 2a 
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Since by definition (z,PPz) — ||P^|| 2 , this is equivalent to finding the solution z that 
minimizes the functional 

\\z-d\\ 2 + \\\Pz\\ 2 

that, in the case of sparse data, becomes the functional of equation (1). 

This sketch of the relation between the Bayesian approach and regularization shows that 
the first term in equation 1, Y^i \\ z i~di\\ 2 ^ is — logP^/z, in other words it represents the known 
model of the measurement process or, equivalently, the model of the noise. The second term, 
||P^|| 2 , is —logP z and therefore is dictated by the prior, that is the a priori knowledge about 
the solution, such as its smoothness properties. 
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